# Bayes 
See lecture slides 03.  
See Duda & Hart section 2.6.  

## Bayes Rule
$P(y=y_i|X) = P(X|y_i) P(y_i) / P(X)$  

Prob of y=yi given X = Prob of X given yi * Prob of yi / Prob of X over all y   
Posterior prob = Class conditional prob * Prior prob / Marginal prob    
Posterior prob = Likelihood of model * Prior prob / Prob of X indepdent of y    

P(covid|cough) = P(cough|covid) * P(covid) / P(cough)   
P(rain|clouds) = P(clouds|rain) * P(rain) / P(clouds)   

Each X is a vector of attributes.  
Each yi is potentially the label for X. 

#### Denominator is a normalizer
Denominator = $P(X) = \sum_i [ P(X|y_i) P(y_i)]$  

The denominator normalizes the result to a probability in range 0 to 1.  
The marginal probability of X means the probability of X regardless of y.    
The numerator has the likelihood of X under one yi, but
the denominator has the likelihood of X under all y.   
Being Bayesian, we weight each P(X|y) by the corresponding P(y).  

The denominator is the same for every class.  
The denominator can be ignored by a classifier.  

#### Derivation
The joint probability of X and y equals the conditional probability
of X given y, times the marginal probability of y.  
$P(X,y) = P(X|y)P(y)$   

By symmetry, you can interchange X and y, so  
$P(X,y) = P(X|y)P(y) = P(y|X)P(X)$   

Thus,    
$P(y|X)P(X)= P(X|y)P(y)$   

Now just divide both sides by P(X)   
$P(y|X) = P(X|y)P(y)/P(X)$

## Bayesian models

### Bayesian models are generative
To train a model, ignore the denominator.  
Estimate P(y=yi) based on the label frequencies in the training data.  
Learn P(X|y) from the training data, given a yi for each xi.   
By learning a probability distribution, we learn a generating function.  
This makes the model generative.  

### Bayesian models can work on continuous features

Choice 1: Discretize features by binning them, and apply Bayes formula.  
Must be careful, as binning strategy matters.

Choice 2: Learn parameters for a distribution.  
Assume each feature has a normal distribution of values, per class.  
For feature j and class i:  
$P(X_j=x_j|Y=y_i) $  
$= \mathcal{N}(\mu_{ij},\sigma_{ij})$  
$= \frac{1}{\sqrt{2 \pi \sigma^2}} e^{\frac{-1}{2}[\frac{x-\mu}{\sigma}]^2}$

Estimate the mean and variance, per feature per class, from the training data.  
$\mu_{ij} = \sum(x_{ij})/n$  
$\sigma_{ij}^2 = \frac{\sum(x_{ij}-\mu_{ij})^2}{(n-1)}$    

During Bayesian inference, 
given an X,
use the above to compute likelihood P(X|yi).  
Then multiply by the prior P(yi).  
Repeat for each class yi.  
Predict the class with the highest probability.

## Naive Bayes Classifier
NBC is: Probabilistic. Generative. Predictive.  

See Duda & Hart 2.6

Not enough data to represent every feature combination? Assume independent features.    
Addresses problem of exponential complexity but insufficient data.   

Without the assumption,    
$P(X|y_i) = $ sum over all possible joint probabilities 
i.e. all feature combinations.   
With the assumption,    
$P(X|y_i) = \prod_j P(X_j|y_i)$    

Assumes independence of features, which is unlikely.    
But NBC is Fairly robust to violations of the assumption.  
Robust to irrelevant features -- they affect every class equally.  

Robust to missing feature values -- just skip them while computing products.  
Assume only a few features are missing per test instance.   
Assume "missingness" is random i.e. not correlated to the class.

Feature correlation undermines the assumption.  
But to negatively affect predictions, 
features need to be correlated differently in different classes.   
For example, price correlates to volume for milk but not caviar.   
This is unlikely to escape your notice.   
Thus, Naive Bayes is somewhat robust to the assumption. 

### Psuedocounts for NB with categorical features
The training data may have zero instances of category c in some feature j in class i.  
If we use zero probability,
this entire product becomes zero, as does all our posteriors:  
$P(X|y_i) = \prod_j P(X_j|y_i)$

Solution 0: Frequentist estimate without adjustment (can be zero)  
$P(X_i = c | y) = (n_{cy}) / (n_{y})$ 

Solution 1: Laplace estimate increases the numerator and denominator  
$P(X_i = c | y) = (n_{cy} + p) / (n_{y} + v_i)$   
where  
$n_{cy}$ = number of class=y instances where feature i has value c  
$n_{y}$ = number of instances in class y  
$p$ = a psuedocount, usually 1   
$v_i$ = number of distinct values of feature i, i.e. total pseudocounts  

Solution 2: m-estimate (one of many maximum-likelihood estimators)  
$P(X_i = c | y) = (n_{cy} + mp) / (n_{y} + m)$   
where  
$n_{cy}$ = number of class=y instances where feature i has value c  
$n_{y}$ = number of instances in class y  
$p$ = a prior guess of the non-zero probability of value c   
$m$ = hyper-parameter weight for p representing confidence in p  

## Bayesian Decision Theory
From Bayes rule: $P(y=y_i|X) = P(X|y_i) P(y_i) / P(X)$    
Ignore the denominator (monotonic anyway).  
Use the log (monotonic anyway).   
Use natural log to balance e in the Gauss pdf.  
Result:

$g_i(x) = ln(p(x|y_i)) + ln(p(y_i))$    

In the two-class case, our model is a dichotomizer.  

Decision rule:   
Classify x as class i if $g_i(x) > g_j(x)$  

Equivalently, classify x as class i if $g(x) > 0$ where $g(x) = g_i(x) - g_j(x)$  

Decision boundary:   
Hyperplane where $g_i(x) = g_j(x)$  

### Multivariate Gaussian   
$\bar X \sim \mathcal{N} (\bar \mu,\Sigma)=(\frac{1}{\sqrt{|\Sigma|} * (2\pi)^{D/2}})e^{[\frac{-1}{2} * (\bar X - \bar \mu)^{T}*\Sigma^{-1}*(\bar X - \bar \mu)]}$  
   
Dichotomizer:   
$g_i(x) = ln(p(x|y_i)) + ln(p(y_i))$    

Use M for Mahalanobis distance:   
$M = [(x - \mu)^{T} \Sigma^{-1} (x - \mu)]$

Substitute a Gaussian for the PDF p(x|y):   
$g_i(x) = [ln(\mathcal{N})] + ln(p(y_i))$  
$g_i(x) = [ln(\operatorname{numerator}) - ln(\operatorname{denominator}) + \operatorname{exponent}] + ln(p(y_i))$  
$g_i(x) = ln(1)-ln(\Sigma^{1/2} (2\pi)^{D/2}) + \frac{-1}{2}M  + ln(p(y_i))$   
$g_i(x) = 0 -\frac{1}{2}ln(\Sigma) -\frac{D}{2}ln(2\pi) -M/2 + ln(p(y_i))$   
The middle term is constant w.r.t. class i, so ignore it.   
$g_i(x) = -\frac{1}{2}ln(\Sigma) -M/2 + ln(p(y_i))$

#### Case 1: Features are independent (no covariance), features have same variance 
Graphical interpretation in 2D:   
every PDF (circle) has different mean but the same radius.
Boundary is perpendicular line between circles.

The decision boundary is a hyperplance orthogonal to the line between the means.   
The placement of the hyperplane depends on the priors.  

From above, we have:  
$g_i(x) = -\frac{1}{2}ln(\Sigma) -M/2 + ln(p(y_i))$

$\Sigma$ is constant for every class i so ignore the first term of gi(x).  

Every class i has the same variance. 
So the covariance matrix is diagonal of variances:   
$\Sigma = \sigma^2 I$  

Also $\Sigma^{-1} = \frac{1}{\sigma^2}$   

So, the Mahalnobis distance M,   
$M = [(x - \mu)^{T} \Sigma^{-1} (x - \mu)]$

reduces to Euclidian distance M:   
$M = \frac{(x-\mu_i)^2}{\sigma^2}$

and the second term is   
$-\frac{(x-\mu_i)^2}{2\sigma^2}$   

So 
$g_i(x) = -\frac{(x-\mu_i)^2}{2\sigma^2} + ln(p(y_i))$   
$g_i(x) = -\frac{x^2-2x\mu_i+\mu_i^2}{2\sigma^2} + ln(p(y_i))$   

But $x^2$ is independent of class i so it can be ignored.  
$g_i(x) = -\frac{-2x\mu_i+\mu_i^2}{2\sigma^2} + ln(p(y_i))$   
$g_i(x) = x\frac{\mu_i}{\sigma^2} - \frac{\mu_i^2}{2\sigma^2} + ln(p(y_i))$   
$g_i(x) = w_i x + b_i + c_i$   
Thus, $g_i(x)$ is a linear discriminant
where w is the weight vector 
and vector b is the bias or threshold.   
The prior (last term: ln(p(yi))) merely shifts the decision plane left or right.  

If x is midway between the means of class i and class j,
and both classes have equal variance,
then it is equally likely to have come from either class,
and the decision depends entirely on the priors (the last term).

The lecture slides arrive at the same by different math.  
Place x at the center of the two means
because the probabilities of each class are the same
and ln(P1/P2) = ln(1) = 0 erases the complex term
leaving just half the sum of the means.

Note.   
Exteme prior can pull the boundary to one side of that mean.

#### Case 2: Features have same covariance per class 
Graphical interpretation in 2D:   
every PDF (elipse) has different mean but same size and orientation.
If the elipses are "tilted" right, so is the decision boundary.

Same as case 1:   
The decision boundary is still linear: a line, plane, or hyperplane.   
Boundary placement depends on the priors.    
Exteme prior can pull the boundary to one side of that mean.

Different from case 1:   
The plane may be "tilted" i.e. not orthogonal to the line between the means.   

#### Case 3: Features have arbitrary variance and covariance 
Graphical interpretation in 2D:   
every PDF (elipse) has different mean, different size, different orientation.

The decision boundary could be any shape,
including hypersphere or hyperparabola.

For two PDFs with same mean but different variance,
the contor lines are concentric.
Points close in are likely from the PDF with low variance,
but points far out are likely from the PDF with high variance.

## Bayesian Belief Network
Also called Bayesian Network.   
From Duda & Hart.   
Also [Jason Brownlee](https://machinelearningmastery.com/introduction-to-bayesian-belief-networks/)

This network model tries to get around the Naive assumption.   
What if features are not independent?   
Take a middle road and let certain chosen features depend on others.

P(a) is independent. P(b) is independent.    
P(d) depends on b: P(d|b).     
P(c) depends on a, d: P(c|a,d).   

The BN must be designed with domain expertise.   
Use a probabilistic graphical network such as a HMM, but use a DAG.   
The graph models conditional independence (no edges) and dependence (edges).   

Python implementation: PyMC3 on Theano.