### Bayes’ Theorem is stated as:

**P(h|d) = (P(d|h) * P(h)) / P(d)**

Where:

- **P(h|d)** is the probability of hypothesis h given the data d. This is called the posterior probability.
- **P(d|h)** is the probability of data d given that the hypothesis h was true.
- **P(h)** is the probability of hypothesis h being true (regardless of the data). This is called the prior probability of h.
- **P(d)** is the probability of the data (regardless of the hypothesis).

You can see that we are interested in calculating the posterior probability of P(h|d) from the prior probability p(h) with P(D) and P(d|h).

---

**After calculating the posterior probability for a number of different hypotheses, you can select the hypothesis with the highest probability**. This is the maximum probable hypothesis and may formally be called the **maximum a posteriori (MAP)** hypothesis.

This can be written as:

**MAP(h) = max(P(h|d))**

or

**MAP(h) = max((P(d|h) * P(h)) / P(d))**

or

**MAP(h) = max(P(d|h) * P(h))**

The P(d) is a normalizing term which allows us to calculate the probability. We can drop it when we are interested in the most probable hypothesis as it is constant and only used to normalize.

**When we want to figure out when X is set, which class for y should has a larger probability. The P(X) is same for different class y. So we can remove it**

---

if we have an **even** number of instances in each class in our training data, then the **probability of each class (e.g. P(h)) will be equal**. Again, this would be a constant term in our equation and we could drop it so that we end up with:

**MAP(h) = max(P(d|h))**

---

## Types of Naive Bayes

#### Multinomial Naive Bayes:

This is mostly used for document classification problem, i.e whether a document belongs to the category of sports, politics, technology etc. The features/predictors used by the classifier are the frequency of the words present in the document.

#### Bernoulli Naive Bayes:

This is similar to the multinomial naive bayes but the predictors are boolean variables. The parameters that we use to predict the class variable take up only values yes or no, for example if a word occurs in the text or not.

#### Gaussian Naive Bayes:

When the predictors take up a continuous value and are not discrete, we assume that these values are sampled from a gaussian distribution.

![](https://miro.medium.com/max/844/1*AYsUOvPkgxe3j1tEj2lQbg.gif)

Since the way the values are present in the dataset changes, the formula for conditional probability changes to

![](https://miro.medium.com/max/1576/1*0If5Mey7FnW_RktMM5BkaQ.png)

# Gaussian Naive Bayes:

Naive Bayes can be extended to **real-valued** attributes, most commonly by **assuming a Gaussian distribution.**

This extension of naive Bayes is called Gaussian Naive Bayes. Other functions can be used to estimate the distribution of the data, but the Gaussian (or Normal distribution) is the easiest to work with because you only need to estimate the mean and the standard deviation from your training data.

## Representation for Gaussian Naive Bayes

Above, we calculated the probabilities for input values for each class using a frequency. With real-valued inputs, we can calculate the mean and standard deviation of input values (x) for each class to summarize the distribution.

This means that in addition to the probabilities for each class, we must also store the mean and standard deviations for each input variable for each class.

Learn a Gaussian Naive Bayes Model From Data
This is as simple as calculating the mean and standard deviation values of each input variable (x) for each class value.

**mean(x) = 1/n * sum(x)**

Where n is the number of instances and x are the values for an input variable in your training data.

We can calculate the standard deviation using the following equation:

**standard deviation(x) = sqrt(1/n * sum(xi-mean(x)^2 ))**

This is the square root of the average squared difference of each value of x from the mean value of x, where n is the number of instances, sqrt() is the square root function, sum() is the sum function, xi is a specific value of the x variable for the i’th instance and mean(x) is described above, and ^2 is the square.

## Make Predictions With a Gaussian Naive Bayes Model

Probabilities of new x values are calculated using the Gaussian Probability Density Function (PDF).

When making predictions these parameters can be plugged into the Gaussian PDF with a new input for the variable, and in return the Gaussian PDF will provide an estimate of the probability of that new input value for that class.

**pdf(x, mean, sd) = (1 / (sqrt(2 * PI) * sd)) * exp(-((x-mean^2)/(2*sd^2)))**

Where pdf(x) is the Gaussian PDF, sqrt() is the square root, mean and sd are the mean and standard deviation calculated above, PI is the numerical constant, exp() is the numerical constant e or Euler’s number raised to power and x is the input value for the input variable.

We can then plug in the probabilities into the equation above to make predictions with real-valued inputs.

For example, adapting one of the above calculations with numerical values for weather and car:

**go-out = P(pdf(weather)|class=go-out) * P(pdf(car)|class=go-out) * P(class=go-out)**

# Best Prepare Your Data For Naive Bayes

#### Categorical Inputs: 
Naive Bayes assumes label attributes such as binary, categorical or nominal.

#### Gaussian Inputs: 
If the input variables are real-valued, a Gaussian distribution is assumed. In which case the algorithm will perform better if the univariate distributions of your data are Gaussian or near-Gaussian. This may require removing outliers (e.g. values that are more than 3 or 4 standard deviations from the mean).

#### Classification Problems: 
Naive Bayes is a classification algorithm suitable for binary and multiclass classification.

#### Log Probabilities: 
The calculation of the likelihood of different class values involves multiplying a lot of small numbers together. This can lead to an underflow of numerical precision. As such it is good practice to use a log transform of the probabilities to avoid this underflow.

#### Kernel Functions: 
Rather than assuming a Gaussian distribution for numerical input values, more complex distributions can be used such as a variety of kernel density functions.

#### Update Probabilities: 
When new data becomes available, you can simply update the probabilities of your model. This can be helpful if the data changes frequently.

# Reference

[Naive Bayes for Machine Learning](https://machinelearningmastery.com/naive-bayes-for-machine-learning/)

[Naive Bayes Classifier From Scratch in Python](https://machinelearningmastery.com/naive-bayes-classifier-scratch-python/)

[Naive Bayes Classifier](https://towardsdatascience.com/naive-bayes-classifier-81d512f50a7c)

In [1]:
# load the iris dataset 
from sklearn.datasets import load_iris 
iris = load_iris() 
  
# store the feature matrix (X) and response vector (y) 
X = iris.data 
y = iris.target 
  
# splitting X and y into training and testing sets 
from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1) 
  
# training the model on training set 
from sklearn.naive_bayes import GaussianNB 
gnb = GaussianNB() 
gnb.fit(X_train, y_train) 
  
# making predictions on the testing set 
y_pred = gnb.predict(X_test) 
  
# comparing actual response values (y_test) with predicted response values (y_pred) 
from sklearn import metrics 
print("Gaussian Naive Bayes model accuracy(in %):", metrics.accuracy_score(y_test, y_pred)*100)

Gaussian Naive Bayes model accuracy(in %): 95.0
