# Bayes 

## Bayes Rule
$P(y=y_i|X) = P(X|y_i) P(y_i) / P(X)$  

One data point X is represented by a vector of attributes.  
The set of ground truth labels y contains i classes, each denoted yi.  
In words, the formula says:   
Prob of yi given X = Prob of X given yi * Prob of yi / Prob of X over all y   
In other words:  
Prob that X is in class yi = Prob of X when yi is the class * Prob of class yi / Prob of X independent of y   
In other words:  
Posterior probability = Class conditional probability * Prior probability / Marginal probability  

#### Normalizer
Denominator = $P(X) = \sum_i [ P(X|y_i) P(y_i)]$  

The denominator normalizes the result to a probability in range 0 to 1.  
The marginal probability of X means the probability of X regardless of y.    
The numerator has the likelihood of X under one yi, but
the denominator has the likelihood of X under all y.   
Rembember to weight each P(X|y) by the corresponding P(y).  

#### Derivation
The joint probability of X and y equals the conditional probability
of X given y, times the marginal probability of y.  
$P(X,y) = P(X|y)P(y)$   

By symmetry, you can interchange X and y, so  
$P(X,y) = P(X|y)P(y) = P(y|X)P(X)$   
or  
$P(X|y)P(y) = P(y|X)P(X)$   

Now just divide both sides by P(X)...   
$P(y|X) = P(X|y)P(y)/P(X)$

## Bayesian models

### Bayesian models are generative
To train a model, ignore the denominator.  
Estimate P(y=yi) based on the label frequencies in the training data.  
Learn P(X|y) from the training data, given a yi for each xi.   
By learning a probability distribution, we learn a generating function.  
This makes the model generative.  

### Bayesian models can work on continuous features

Choice 1: Discretize features by binning them, and apply Bayes formula.  
Must be careful, as binning strategy matters.

Choice 2: Learn parameters for a distribution.  
Assume each feature has a normal distribution of values, per class.  
For feature j and class i:  
$P(X_j=x_j|Y=y_i) $  
$= \mathcal{N}(\mu_{ij},\sigma_{ij})$  
$= \frac{1}{\sqrt{2 \pi \sigma^2}} e^{\frac{-1}{2}[\frac{x-\mu}{\sigma}]^2}$

Estimate the mean and standard deviation, per class, from the training data.  
$\mu_{ij} = \sum(x_{ij})/n$  
$\sigma_{ij} = \sqrt{ \sum(x_{ij}-\mu_{ij})/(n-1) }$

During Bayesian inference, 
given an X,
use the above to compute likelihood P(X|yi).  
Then multiply by the prior P(yi).  
Repeat for each class yi.  
Predict the class with the highest probability.

## Naive Bayes Classifier
Probabilistic.  
Generative.  
Predictive.  
Robust to irrelevant features -- they affect every class equally.  
Robust to missing feature values -- just skip them while computing products.  
(Assume only a few features are missing per test instance
and missingness of a feature is not correlated to the class.)

Assumes independence of features.  
Fairly robust to violations of the assumption.  

Addresses problem of exponential complexity supported by insufficient data:   
$P(X|y_i)$ involves a sum over all possible joint probabilities of feature combinations.     

Solution: naively assume the features of X are independent.  
With this assumption,    
$P(X|y_i) = \prod_j P(X_j|y_i)$

Feature correlation can undermine the assumption.  
But to negatively affect predictions, 
features need to be correlated differently in different classes.   
Thus, Naive Bayes is somewhat robust to the assumption. 

### Psuedocounts for NB with categorical features
The training data may have zero instances of category c in some feature j in class i.  
If we use zero probability,
this entire product becomes zero, as does all our posteriors:  
$P(X|y_i) = \prod_j P(X_j|y_i)$

Solution 0: Frequentist estimate without adjustment (can be zero)  
$P(X_i = c | y) = (n_{cy}) / (n_{y})$ 

Solution 1: Laplace estimate increases the numerator and denominator  
$P(X_i = c | y) = (n_{cy} + p) / (n_{y} + v_i)$   
where  
$n_{cy}$ = number of class=y instances where feature i has value c  
$n_{y}$ = number of instances in class y  
$p$ = a psuedocount, usually 1   
$v_i$ = number of distinct values of feature i  

Solution 2: m-estimate (one of many maximum-likelihood estimators)  
$P(X_i = c | y) = (n_{cy} + mp) / (n_{y} + m)$   
where  
$n_{cy}$ = number of class=y instances where feature i has value c  
$n_{y}$ = number of instances in class y  
$p$ = a prior guess of the non-zero probability of value c   
$m$ = hyper-parameter weight for p representing confidence in p  

## Bayesian Decision Theory


From Bayes rule: $P(y=y_i|X) = P(X|y_i) P(y_i) / P(X)$    
Ignore the denominator (monotonic anyway).  
Use the log (monotonic anyway).   
Use natural log to balance e in the Gauss pdf.  
To: $g_i(x) = ln(p(x|y_i)) + ln(p(y_i))$    

Decision rule:   
In the two-class case, our model is a dichotomizer.  
Classify x as class i if $g_i(x) > g_j(x)$  
Equivalently, classify x as class i if $g(x) > 0$ where $g(x) = g_i(x) - g_j(x)$  

Decision boundary:   
Hyperplane where $g_i(x) = g_j(x)$  

Multivariate Gaussian:   
$\bar X \sim \mathcal{N} (\bar \mu,\Sigma)=(\frac{1}{\sqrt{|\Sigma|} * (2\pi)^{D/2}})e^{[\frac{-1}{2} * (\bar X - \bar \mu)^{T}*\Sigma^{-1}*(\bar X - \bar \mu)]}$  
   
Dichotomizer:   
$g_i(x) = ln(p(x|y_i)) + ln(p(y_i))$    

Substitute Gauss (where M = Mahalanobis distance):   
$g_i(x)=ln(\mathcal{N}) + ln(p(y_i))$  
$= ln(1)-ln(\Sigma^{1/2} (2\pi)^{D/2}) + \frac{-1}{2}M  + ln(p(y_i))$   
$= 0 -\frac{1}{2}ln(\Sigma) -\frac{D}{2}ln(2\pi) -M/2 + ln(p(y_i))$

