## Probability for Machine Learning

> Notes based on blog by Jason Brownlee: https://machinelearningmastery.com/bayes-theorem-for-machine-learning/

---

### Basic concepts
<img src="https://i.imgur.com/l3GDwZR.jpeg" width="600" height="700"/>
<img src="https://i.imgur.com/eX5ELeX.jpeg" width="600" height="700"/>


### Worked Example for Calculating Bayes Theorem


Scenario: Consider a human population that may or may not have cancer (Cancer is True or False) and a medical test that returns positive or negative for detecting cancer (Test is Positive or Negative), e.g. like a mammogram for detecting breast cancer.

<img src="https://i.imgur.com/MZ7lHIA.jpeg" width="600" height="800"/>
<img src="https://i.imgur.com/QdgcEA7.jpeg" width="700" height="400"/>


The calculation suggests that if the patient is informed they have cancer with this test, then there is only 0.33% chance that they have cancer.

The example also shows that the calculation of the conditional probability requires enough information.

For example, if we have the values used in Bayes Theorem already, we can use them directly.

This is rarely the case, and we typically have to calculate the bits we need and plug them in, as we did in this case. In our scenario we were given 3 pieces of information, the the base rate (P(cancer)), the  sensitivity (or true positive rate - P(+ve | cancer)), and the specificity (or true negative rate - P(-ve|not cancer)).

- Sensitivity: 85% of people with cancer will get a positive test result.
- Base Rate: 0.02% of people have cancer.
- Specificity: 95% of people without cancer will get a negative test result.


### Baye's Theorem in Python

In [None]:
#### A: cancer B: test; need P(A|B)

def bayes_theorem(p_a, p_b_given_a, p_not_b_given_not_a):
    """
    Inputs:
    p_a: base rate
    p_b_given_a: TPR: sensitivity
    p_not_b_given_not_a : TNR: specificity

    Returns P(A|B)
    """

    p_not_a = 1 - p_a
    p_b_given_not_a = 1 - p_not_b_given_not_a

    p_b = p_b_given_a * p_a + p_b_given_not_a*p_not_a

    p_a_given_b = (p_b_given_a * p_a)/p_b

    return p_a_given_b

p_a = 0.0002
p_b_given_a = 0.85
p_not_b_given_not_a = 0.95

result = bayes_theorem(p_a, p_b_given_a, p_not_b_given_not_a)
print('P(A|B) = %.3f%%' % (result * 100))

P(A|B) = 0.339%


### Binary Classifier Terminology


It may be helpful to think about the cancer test example in terms of the common terms from binary (two-class) classification, i.e. where notions of specificity and sensitivity come from.

![](https://i.imgur.com/0nKYJTi.png)

Recall that in a previous section that we calculated the false positive rate given the complement of true negative rate, or FPR = 1.0 – TNR.

Some of these rates have special names, for example:

- Sensitivity = TPR
- Specificity = TNR

We can map these rates onto familiar terms from Bayes Theorem (A: cancer B: test):

- P(B|A): True Positive Rate (TPR).
- P(not B|not A): True Negative Rate (TNR).
- P(B|not A): False Positive Rate (FPR).
- P(not B|A): False Negative Rate (FNR).

We can also map the base rates for the condition (class) and the treatment (prediction) on familiar terms from Bayes Theorem:

- P(A): Probability of a Positive Class (PC).
- P(not A): Probability of a Negative Class (NC).
- P(B): Probability of a Positive Prediction (PP).
- P(not B): Probability of a Negative Prediction (NP).

Now, let’s consider Bayes Theorem using these terms:

```
P(A|B) = P(B|A) * P(A) / P(B)

P(A|B) = (TPR * PC) / PP

Where we often cannot calculate P(B), so we use an alternative:

P(B) = P(B|A) * P(A) + P(B|not A) * P(not A)
P(B) = TPR * PC + FPR * NC

```
Now, let’s look at our scenario of cancer and a cancer detection test.

- True Positive Rate (TPR): 85%
- False Positive Rate (FPR): 5% (1-TNR)
- True Negative Rate (TNR): 95%
- False Negative Rate (FNR): 15% (1-TPR)

Let’s also review what we know about base rates:

- Positive Class (PC): 0.02%
- Negative Class (NC): 99.98%


```
P(B) = P(B|A) * P(A) + P(B|not A) * P(not A)
P(B) = TPR * PC + FPR * NC
P(B) = 85% * 0.02% + 5% * 99.98%
P(B) = 5.016%


P(A|B) = P(B|A) * P(A) / P(B)
P(A|B) = TPR * PC / PP
P(A|B) = 85% * 0.02% / 5.016%
P(A|B) = 0.339%
```

#### Conditional probabilities in terms of precision and recall

When a search engine returns 30 pages, only 20 of which are relevant, while failing to return 40 additional relevant pages, its precision is 20/30 = 2/3, which tells us how valid the results are, while its recall is 20/60 = 1/3, which tells us how complete the results are.

B : test returned +ve
A: determines if actually cancer

So P(A|B) is actually telling us how valid the results are

precision = TP/(TP+FP) = TPR*PC/PP = P(A|B) as we have seen earlier

recall = TP/(TP+FN) = TPR*PC/PC = P(B|A) * P(A)/P(A) = P(B|A) -> given cancer is there what is the prob that test returns cancer -> given 30 results are there, how many are returned


### How Baye's Theorem fits in MAP and MLE

Bayes Theorem is a useful tool in applied machine learning. It provides a way of thinking about the relationship between data and a model.

A machine learning algorithm or model is a specific way of thinking about the structured relationships in the data. In this way, a model can be thought of as a hypothesis about the relationships in the data, such as the relationship between input (X) and output (y). The practice of applied machine learning is the testing and analysis of different hypotheses (models) on a given dataset.

Bayes Theorem provides a probabilistic model to describe the relationship between data (D) and a hypothesis (h); for example:

P(h|D) = P(D|h) * P(h) / P(D)

Breaking this down, it says that the probability of a given hypothesis holding or being true given some observed data can be calculated as the probability of observing the data given the hypothesis multiplied by the probability of the hypothesis being true regardless of the data, divided by the probability of observing the data regardless of the hypothesis.

Under this framework, each piece of the calculation has a specific name; for example:

- P(h|D): Posterior probability of the hypothesis (the thing we want to calculate).
- P(h): Prior probability of the hypothesis.

If we have some prior domain knowledge about the hypothesis, this is captured in the prior probability. If we don’t, then all hypotheses may have the same prior probability.

If the probability of observing the data P(D) increases, then the probability of the hypothesis holding given the data P(h|D) decreases. Conversely, if the probability of the hypothesis P(h) and the probability of observing the data given hypothesis increases, the probability of the hypothesis holding given the data P(h|D) increases.

The notion of testing different models on a dataset in applied machine learning can be thought of as estimating the probability of each hypothesis (h1, h2, h3, … in H) being true given the observed data.

The optimization or seeking the hypothesis with the maximum posterior probability in modeling is called maximum a posteriori or MAP for short.

Any such maximally probable hypothesis is called a maximum a posteriori (MAP) hypothesis. We can determine the MAP hypotheses by using Bayes theorem to calculate the posterior probability of each candidate hypothesis.

Under this framework, the probability of the data (D) is constant as it is used in the assessment of each hypothesis. Therefore, it can be removed from the calculation to give the simplified unnormalized estimate as follows:

`max h in H P(h|D) = P(D|h) * P(h)`

If we do not have any prior information about the hypothesis being tested, they can be assigned a uniform probability, and this term too will be a constant and can be removed from the calculation to give the following:

`max h in H P(h|D) = P(D|h)` - MAP framework (we want to maximize the posteriori hypothesis amongst all possible hypothesis)


That is, the goal is to locate a hypothesis that best explains the observed data. Fitting models like linear regression for predicting a numerical value, and logistic regression for binary classification can be framed and solved under the MAP probabilistic framework. This provides an alternative to the more common maximum likelihood estimation (MLE) framework.

There are two probabilistic frameworks that underlie many different machine learning algorithms.
- Maximum a Posteriori (MAP), a Bayesian method.
- Maximum Likelihood Estimation (MLE), a frequentist method.

The objective of both of these frameworks in the context of machine learning is to locate the hypothesis that is most probable given the training dataset.
Specifically, they answer the question: What is the most probable hypothesis given the training data?

Both approaches frame the problem of fitting a model as optimization and involve searching for a distribution and set of parameters for the distribution that best describes the observed data.

Given the simplification of Bayes Theorem to a proportional quantity, we can use it to estimate the proportional hypothesis and parameters (theta) that explain our dataset (X), stated as:

`P(theta | X) = P(X | theta) * P(theta)`-Maximizing this quantity over a range of theta solves an optimization problem for estimating the central tendency of the posterior probability (e.g. the model of the distribution). As such, this technique is referred to as “maximum a posteriori estimation,” or MAP estimation for short, and sometimes simply “maximum posterior estimation.” - `maximize P(X | theta) * P(theta)`

Now that we are familiar with the MAP framework, we can take a closer look at the related concept of the Bayes optimal classifier.

#### Bayes Optimal Classifier

The Bayes optimal classifier is a probabilistic model that makes the most probable prediction for a new example, given the training dataset. This model is also referred to as the Bayes optimal learner, the Bayes classifier, Bayes optimal decision boundary, or the Bayes optimal discriminant function. Specifically, the Bayes optimal classifier answers the question - What is the most probable classification of the new instance given the training data?

MAP : What is the most probable hypothesis given the training data? | Bayes Optimal Classifier: What is the most probable classification of the new instance given the training data. This is different from the MAP framework that seeks the most probable hypothesis (model). Instead, we are interested in making a specific prediction.

The equation below demonstrates how to calculate the conditional probability for a new instance (vi) given the training data (D), given a space of hypotheses (H).

`P(vj | D) = sum {h in H} P(vj | hi) * P(hi | D)`

Where vj is a new instance to be classified, H is the set of hypotheses for classifying the instance, hi is a given hypothesis, P(vj | hi) is the posterior probability for vi given hypothesis hi, and P(hi | D) is the posterior probability of the hypothesis hi given the data D.

For example consider we have 3 hypothesis : h1,h2,h3, and we get an instance v_i

`P(vi|D) = P(vi|h1)*P(h1|D) + P(vi|h2)*P(h2|D) + P(vi|h2)*P(h2|D)`

Selecting the outcome with the maximum probability is an example of a Bayes optimal classification.

vi might be obseving a class label - `P(vi|D) gives a prob for that class - Selecting the outcome with the maximum probability is an example of a Bayes optimal classification.

> Any model that classifies examples using this equation is a Bayes optimal classifier and no other model can outperform this technique, on average.
In theory we would always like to predict qualitative responses using the Bayes classifier. But for real data, we do not know the conditional distribution of Y given X, and so computing the Bayes classifier is impossible. Therefore, the Bayes classifier serves as an unattainable gold standard against which to compare other methods.


We have to let that sink in.

It is a big deal.

It means that any other algorithm that operates on the same data, the same set of hypotheses, and same prior probabilities cannot outperform this approach, on average. Hence the name “optimal classifier.”

Although the classifier makes optimal predictions, it is not perfect given the uncertainty in the training data and incomplete coverage of the problem domain and hypothesis space. As such, the model will make errors. These errors are often referred to as Bayes errors.

__Bayes Error__: The minimum possible error that can be made when making predictions.

Because of the computational cost of this optimal strategy, we instead can work with direct simplifications of the approach.

Naive Bayes: Assume that variables in the input data are conditionally independent.



### Bayes Theorem for Classification

Classification is a predictive modeling problem that involves assigning a label to a given input data sample. The problem of classification predictive modeling can be framed as calculating the conditional probability of a class label given a data sample, for example:

- P(class|data) = (P(data|class) * P(class)) / P(data)

Where P(class|data) is the probability of class given the provided data. This calculation can be performed for each class in the problem and the class that is assigned the largest probability can be selected and assigned to the input data.

__In practice, it is very challenging to calculate full Bayes Theorem for classification.__

The priors for the class and the data are easy to estimate from a training dataset, if the dataset is suitability representative of the broader problem.

The conditional probability of the observation based on the class P(data|class) is not feasible unless the number of examples is extraordinarily large, e.g. large enough to effectively estimate the probability distribution for all different possible combinations of values. This is almost never the case, we will not have sufficient coverage of the domain.

As such, the direct application of Bayes Theorem also becomes intractable, especially as the number of variables or features (n) increases.






### Naive Bayes Classifier

The solution to using Bayes Theorem for a conditional probability classification model is to simplify the calculation.

The Bayes Theorem assumes that each input variable is dependent upon all other variables. This is a cause of complexity in the calculation. We can remove this assumption and consider each input variable as being independent from each other.

This changes the model from a dependent conditional probability model to an independent conditional probability model and dramatically simplifies the calculation.

This means that we calculate P(data|class) for each input variable separately and multiple the results together, for example:

`P(class | X1, X2, …, Xn) = P(X1|class) * P(X2|class) * … * P(Xn|class) * P(class) / P(data)`

We can also drop the probability of observing the data as it is a constant for all calculations, for example:

`P(class | X1, X2, …, Xn) = P(X1|class) * P(X2|class) * … * P(Xn|class) * P(class)`

This simplification of Bayes Theorem is common and widely used for classification predictive modeling problems and is generally referred to as Naive Bayes.

The word “naive” is French and typically has a diaeresis (umlaut) over the “i”, which is commonly left out for simplicity, and “Bayes” is capitalized as it is named for Reverend Thomas Bayes.




<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=e3bd92dc-1f37-46e3-a6da-869792ce3a08' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>