In [1]:
from scipy import stats
import numpy as np
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

# Bayes Theorem of Conditional Probability
Before we dive into Bayes theorem, let’s review marginal, joint, and conditional probability.

Recall that marginal probability is the probability of an event, irrespective of other random variables. If the random variable is independent, then it is the probability of the event directly, otherwise, if the variable is dependent upon other variables, then the marginal probability is the probability of the event summed over all outcomes for the dependent variables, called the sum rule.

 - __Marginal Probability__: The probability of an event irrespective of the outcomes of other random variables, e.g. P(A).
 
The joint probability is the probability of two (or more) simultaneous events, often described in terms of events A and B from two dependent random variables, e.g. X and Y. The joint probability is often summarized as just the outcomes, e.g. A and B.

 - __Joint Probability__: Probability of two (or more) simultaneous events, e.g. P(A and B) or P(A, B).

The conditional probability is the probability of one event given the occurrence of another event, often described in terms of events A and B from two dependent random variables e.g. X and Y.

- __Conditional Probability__: Probability of one (or more) event given the occurrence of another event, e.g. P(A given B) or P(A | B).

The joint probability can be calculated using the conditional probability; for example:

$P(A, B) = P(A | B) * P(B)$

This is called the product rule. Importantly, the joint probability is symmetrical, meaning that:

$P(A, B) = P(B, A)$

The conditional probability can be calculated using the joint probability; for example:

$P(A | B) = P(A, B) / P(B)$

The conditional probability is not symmetrical; for example:

$P(A | B) != P(B | A)$

We are now up to speed with marginal, joint and conditional probability. If you would like more background on these fundamentals, see the tutorial:

 - __[A Gentle Introduction to Joint, Marginal, and Conditional Probability](https://machinelearningmastery.com/joint-marginal-and-conditional-probability-for-machine-learning/)__

# An Alternate Way To Calculate Conditional Probability

Now, there is another way to calculate the conditional probability.

Specifically, one conditional probability can be calculated using the other conditional probability; for example:

$P(A|B) = P(B|A) * P(A) / P(B)$

The reverse is also true; for example:

$P(B|A) = P(A|B) * P(B) / P(A)$

This alternate approach of calculating the conditional probability is useful either when the joint probability is challenging to calculate (which is most of the time), or when the reverse conditional probability is available or easy to calculate.

This alternate calculation of the conditional probability is referred to as Bayes Rule or Bayes Theorem, named for Reverend Thomas Bayes, who is credited with first describing it. It is grammatically correct to refer to it as Bayes’ Theorem (with the apostrophe), but it is common to omit the apostrophe for simplicity.

 - __Bayes Theorem__: Principled way of calculating a conditional probability without the joint probability.
It is often the case that we do not have access to the denominator directly, e.g. P(B).

We can calculate it an alternative way; for example:

$P(B) = P(B|A) * P(A) + P(B|not A) * P(not A)$

This gives a formulation of Bayes Theorem that we can use that uses the alternate calculation of P(B), described below:

$P(A|B) = P(B|A) * P(A) / P(B|A) * P(A) + P(B|not A) * P(not A)$

Or with brackets around the denominator for clarity:

$P(A|B) = P(B|A) * P(A) / (P(B|A) * P(A) + P(B|not A) * P(not A))$

Note: the denominator is simply the expansion we gave above.

As such, if we have P(A), then we can calculate P(not A) as its complement; for example:

$P(not A) = 1 – P(A)$

Additionally, if we have P(not B|not A), then we can calculate P(B|not A) as its complement; for example:

$P(B|not A) = 1 – P(not B|not A)$

Now that we are familiar with the calculation of Bayes Theorem, let’s take a closer look at the meaning of the terms in the equation.

# Naming the Terms in the Theorem

The terms in the Bayes Theorem equation are given names depending on the context where the equation is used.

It can be helpful to think about the calculation from these different perspectives and help to map your problem onto the equation.

Firstly, in general, the result P(A|B) is referred to as the posterior probability and P(A) is referred to as the prior probability.

$P(A|B): Posterior probability.$
$P(A): Prior probability.$

Sometimes P(B|A) is referred to as the likelihood and P(B) is referred to as the evidence.

$P(B|A): Likelihood.$

$P(B): Evidence.$

This allows Bayes Theorem to be restated as:

Posterior = Likelihood * Prior / Evidence
We can make this clear with a smoke and fire case.

> __What is the probability that there is fire given that there is smoke?__

Where P(Fire) is the Prior, P(Smoke|Fire) is the Likelihood, and P(Smoke) is the evidence:

$P(Fire|Smoke) = P(Smoke|Fire) * P(Fire) / P(Smoke)$

You can imagine the same situation with rain and clouds.

Now that we are familiar with Bayes Theorem and the meaning of the terms, let’s look at a scenario where we can calculate it.

# Naive Bayes Classifier

The solution to using Bayes Theorem for a conditional probability classification model is to simplify the calculation.

The Bayes Theorem assumes that each input variable is dependent upon all other variables. This is a cause of complexity in the calculation. We can remove this assumption and consider each input variable as being independent from each other.

This changes the model from a dependent conditional probability model to an independent conditional probability model and dramatically simplifies the calculation.

This means that we calculate P(data|class) for each input variable separately and multiple the results together, for example:

$P(class | X1, X2, …, Xn) = P(X1|class) * P(X2|class) * … * P(Xn|class) * P(class) / P(data)$

We can also drop the probability of observing the data as it is a constant for all calculations, for example:

$P(class | X1, X2, …, Xn) = P(X1|class) * P(X2|class) * … * P(Xn|class) * P(class)$

This simplification of Bayes Theorem is common and widely used for classification predictive modeling problems and is generally referred to as Naive Bayes.

The word “naive” is French and typically has a diaeresis (umlaut) over the “i”, which is commonly left out for simplicity, and “Bayes” is capitalized as it is named for Reverend Thomas Bayes.

For tutorials on how to implement Naive Bayes from scratch in Python see:

 - __[How to Develop a Naive Bayes Classifier from Scratch in Python](https://machinelearningmastery.com/classification-as-conditional-probability-and-the-naive-bayes-algorithm/)__
 
## Original Version

In [10]:
# example of preparing and making a prediction with a naive bayes model
from sklearn.datasets import make_blobs
from scipy.stats import norm
from numpy import mean
from numpy import std

# fit a probability distribution to a univariate data sample
def fit_distribution(data):
    # estimate parameters
    mu = mean(data)
    sigma = std(data)
    # fit distribution
    dist = norm(mu, sigma)
    return dist

# calculate the independent conditional probability
def probability(X, prior, dist1, dist2):
    return prior * dist1.pdf(X[0]) * dist2.pdf(X[1])

# generate 2d classification dataset
X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1)
# sort data into classes
Xy0 = X[y == 0]
y0 = y[y==0]
Xy1 = X[y == 1]
y1 = y[y==1]
# calculate priors
priory0 = len(Xy0) / len(X)
priory1 = len(Xy1) / len(X)
# create PDFs for y==0
distX1y0 = fit_distribution(Xy0[:, 0])
distX2y0 = fit_distribution(Xy0[:, 1])
# create PDFs for y==1
distX1y1 = fit_distribution(Xy1[:, 0])
distX2y1 = fit_distribution(Xy1[:, 1])
# classify one example
X1sample, y1sample = Xy0[0], y0[0]
X2sample, y2sample = Xy1[0], y1[0]

X1py0 = probability(X1sample, priory0, distX1y0, distX2y0)
X2py0 = probability(X2sample, priory0, distX1y0, distX2y0)

print('X1 -> P(y=0 | %s) = %.3f' % (X1sample, X1py0*100))
print('X1 -> Truth: y=%d' % np.sqrt((y1sample-1)**2))
print('\nX2 -> P(y=0 | %s) = %.3f' % (X2sample, X2py0*100))
print('X2 -> Truth: y=%d' % np.sqrt((y2sample-1)**2))

X1 -> P(y=0 | [-0.79415228  2.10495117]) = 0.348
X1 -> Truth: y=1

X2 -> P(y=0 | [-9.15155186 -4.81286449]) = 0.000
X2 -> Truth: y=0


## Non-supervised version

In [4]:
# example of preparing and making a prediction with a naive bayes model
from sklearn.datasets import make_blobs
from scipy.stats import norm,gaussian_kde
from numpy import mean
from numpy import std

# fit a probability distribution to a univariate data sample
def fit_distribution(data):
    # estimate parameters
    mu = mean(data)
    sigma = std(data)
    # fit distribution
    dist = norm(mu, sigma)
    return dist

# calculate the independent conditional probability
def probability(X, prior, dist1, dist2):
    return prior * dist1.pdf(X[0]) * dist2.pdf(X[1])

# calculate the independent conditional probability
def probability_kde (X, prior, dist1, dist2):
    return prior * dist1(X[0]) * dist2(X[1])

# generate 2d classification dataset
X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1)
# sort data into classes
Xy0 = X[y == 0]
y0 = y[y==0]
Xy1 = X[y == 1]
y1 = y[y==1]
# calculate priors
priory0 = len(Xy0) / len(X)
# create PDFs for y==0
seed_kde1 = gaussian_kde(Xy0[:, 0])
seed_kde2 = gaussian_kde(Xy0[:, 1])
distX1y0 = fit_distribution(Xy0[:, 0])
distX2y0 = fit_distribution(Xy0[:, 1])
# classify one example
X1sample, y1sample = Xy0[0], y0[0]
X2sample, y2sample = Xy1[0], y1[0]

X1py0 = probability(X1sample, priory0, distX1y0, distX2y0)
X2py0 = probability(X2sample, priory0, distX1y0, distX2y0)

print('\nUsing Normal Distribution')
print('X1 -> P(y=0 | %s) = %.3f' % (X1sample, X1py0*100))
print('X1 -> Truth: y=%d' % np.sqrt((y1sample-1)**2))
print('\nX2 -> P(y=0 | %s) = %.3f' % (X2sample, X2py0*100))
print('X2 -> Truth: y=%d' % np.sqrt((y2sample-1)**2))

X1py0 = probability_kde(X1sample, priory0, seed_kde1, seed_kde2)
X2py0 = probability_kde(X2sample, priory0, seed_kde1, seed_kde2)

print('\nUsing KDE')
print('X1 -> P(y=0 | %s) = %.3f' % (X1sample, X1py0*100))
print('X1 -> Truth: y=%d' % np.sqrt((y1sample-1)**2))
print('\nX2 -> P(y=0 | %s) = %.3f' % (X2sample, X2py0*100))
print('X2 -> Truth: y=%d' % np.sqrt((y2sample-1)**2))


Using Normal Distribution
X1 -> P(y=0 | [-0.79415228  2.10495117]) = 0.348
X1 -> Truth: y=1

X2 -> P(y=0 | [-9.15155186 -4.81286449]) = 0.000
X2 -> Truth: y=0

Using KDE
X1 -> P(y=0 | [-0.79415228  2.10495117]) = 0.622
X1 -> Truth: y=1

X2 -> P(y=0 | [-9.15155186 -4.81286449]) = 0.000
X2 -> Truth: y=0


# Final function

In [5]:
def kde_naive_bayes (X, kde_dict, prior):
    # calculate the independent conditional probability
    L,_ = np.shape(X)
    prob = np.ones((L)) * prior
        
    for i in range(L):
        for kde in kde_dict:
            prob[i] *= kde_dict[kde](X[i,kde])
    return prob

def kde_dictionary (X):
    _,W = np.shape(X)
    # Creating KDE dictionary
    s = set(range(W))
    kde_dict = dict.fromkeys(s)
    for kde in kde_dict:
        kde_dict[kde] = gaussian_kde(X[:, kde])
    return kde_dict

seed_kde_dict = kde_dictionary(Xy0)

print(np.shape(kde_naive_bayes(Xy0,seed_kde_dict,priory0)))
print(kde_naive_bayes(Xy0,seed_kde_dict,priory0).mean())
print(np.shape(kde_naive_bayes(Xy1,seed_kde_dict,priory0)))
print(kde_naive_bayes(Xy1,seed_kde_dict,priory0).mean())
print(np.shape(kde_naive_bayes(X,seed_kde_dict,priory0)))
print(kde_naive_bayes(X,seed_kde_dict,priory0).mean())

(50,)
0.050993807837646836
(50,)
4.23308571425288e-74
(100,)
0.02549690391882341
