# Bayesian Learning

Bayesian reasoning provides a probabilistic approach to 
inference.It is based on the assumption that the quantities of interest 
are governed by probability distributions and that optimal 
decisions can be made by reasoning about these 
probabilities together with observed data.

Bayesian learning methods are relevant to our study of 
machine learning for two different reasons.
 - Bayesian learning algorithms that calculate explicit probabilities 
for hypotheses, such as the naive Bayes classifier, are among the 
most practical approaches to certain types of learning problems.
 - They provide a useful perspective for understanding many 
learning algorithms that do not explicitly manipulate 
probabilities

## Features of Bayesian learning methods

 -  Each observed training example can incrementally
decrease or increase the estimated probability that a
hypothesis is correct.
 - Prior knowledge can be combined with observed data
to determine the final probability of a hypothesis.
 - Bayesian methods can accommodate hypotheses that
make probabilistic predictions
 - New instances can be classified by combining the
predictions of multiple hypotheses, weighted by their
probabilities.
 - Even in cases where Bayesian methods prove
computationally intractable, they can provide a
standard of optimal decision making against which
other practical methods can be measured.



# Bayes Theorem
If $E_1, E_2, ... , E_n $ are n mutually disjoint event with $P(E_i) \neq 0, ~ \forall i$ then for any event $A$ which is a subset of $E_i$ such that $P(A) > 0$ then:

$\large P(E_i|A) = \frac{P(E_i)P(A|E_i)}{\sum_{i = 1}^{n}P(E_i)P(A|E_i)} \Rightarrow \frac{P(E_i)P(A|E_i)}{P(A)}$

$P(A) = {\sum_{i = 1}^{n}P(E_i)P(A|E_i)} $ refers to theorem of total probability

A <b>prior probability</b> is an initial probability value originally 
obtained before any additional information is obtained.

A <b>posterior probability</b> is a probability value that has been 
revised by using additional information that is later obtained.


we also demonstrate Bayes theorem as in term of hypothesis:

$\large P(h|D) = \frac{P(D|h)P(h)}{P(D)}$

 - $P(h)$ - prior probability of hypothesis h
 - $P(D)$ - prior probability of training data D
 - $P(h|D)$ - probability of h given D
 - $P(D|h)$ - probability of D given h

# Choosing Hypothesis
Generally we want the most probable hypothesis given the training data.

<b>Maximum a posteriori hypothesis</b> $h_{MAP}$

$$\large h_{MAP} = \underset{h \in H}{arg max} ~ P(h|D) $$
$$\large \qquad \quad\Rightarrow \underset{h \in H}{arg max} ~\frac{P(D|h)P(h)}{P(D)}$$
$$\large \qquad \quad\Rightarrow \underset{h \in H}{arg max}~ {P(D|h)P(h)}$$

If we assume $P(h_i) = P(h_j)$ then can further simplify and choose the <b>Maximum Likelihood(ML)  </b>hypothesis

$$\large h_{ML} = \underset{h_i \in H}{arg max}~ P(D|h_i)$$

# Brute Force MAP  Learning Algorithm
 1. For each hypothesis h in H , calculate posterior probability $\large P(h|D) = \frac{P(D|h)P(h)}{P(D)}$
 2. Output the hypothesis $\large h_{MAP} = \underset{h \in H}{arg max} P(h|D)$
 
Lets consider following assumptions
1. The training data D is noise free $(i.e., d_i = c(x_i))$.
2. The target concept c is contained in the hypothesis space H
3. We have no a priori reason to believe that any hypothesis is more probable than any other.
what values should we specify for P(h)?

$P(h) = \frac{1}{|H|}$ for all h in H

What choice shall we make for $P(D|h)$?

$P(D|h)$ is the probability of observing the target values $D = <d_1 . . .d_m> $ for the fixed set of instances $<X_1 . . . X_m>$, given a world in which hypothesis h holds


$$
\large P(D|h) = 
\begin{cases}
\text{1 $\qquad$ if $d_i = h(x_i)$ for all $d_i$ in D }\\
\text{0 $\qquad$ Ohterwise}
\end{cases}\\
$$

In other words, the probablility of data D given hypothesis h is 1 if D is consistent with h and 0 otherwise.
$$P(h|D) = \frac{0.P(h)}{P(D)} \Rightarrow 0 \quad $$ if h is inconsistent with D

The posterior probability of a hypothesis inconsistent with D is zero. 

Now consider the case where h is consistent with D

$$
P(h|D) = \frac{1\frac{1}{|H|}}{P(D)}
$$
$\therefore P(D) = \underset{h_i \in H}\sum{P(D|h_i)P(h_i)}$
