# Bayesian classification

$$
\newcommand\given[1][]{\:#1\mid\:}
\def\*#1{\mathbf{#1}}
\newcommand{\argmin}{\mathop{\mathrm{argmin}}}
\newcommand{\argmax}{\mathop{\mathrm{argmax}}}
$$




Suppose we observe data $\*x = (x_1, ..., x_m)$ of $m$ "features" and we want to assign probabilities to each of $K$ different classes $\{C_1, ..., C_K\}$.

Bayes' theorem allows us to invert this as:

$$
p(C_k \given \*x) = \frac{p(C_k) p(\*x \given C_k)}{p(x)}
$$

Recall that we define these plain English terms:

$$
\text{posterior} = \frac{\text{prior} \times \text{likelihood}}{\text{evidence}}
$$

### Bayes classification

We can construct a classifier from this conditional probability with a decision rule. One common rule is to choose the hypothesis (class) that is most probable. This is known as the "maximum a posteriori" or MAP decision rule. Applying this gives us a **Bayes classifier**: a function that assigns the class label $\hat{y} = C_k$, where $k$ is given by:

$$
\argmax_{k \in \{1, ..., K\}} p(C_k) p(\*x \given C_k)
$$



What about the evidence term in the denominator? 
This is the irrelevant to the classification problem; it is the same for all classes $C_k$.

### Uncertainty estimates

Notice that the MAP decision rule **throws away information**. The most probable class is only one outcome of the Bayesian formulation of the classification problem above. We can also also read off the probabilities of other classes, the odds (or log odds) of the data belonging to one class or another, etc. We will come back to this below.

### Modelling the class likelihoods

How do we go about modelling the class likelihoods $p(\* x \given C_k)$?

In relatively low-dimensional problems, we can infer these using **density estimation** (for example, a histogram-smoothing procedure like kernel density estimation).

If we have many features (if $m$ is large), our models for the likelihoods $p(\*x \given C_k)$ will be high-dimensional. Even with "big data" $n \approx 10^{12}$ or $n \approx 10^{15}$, the number of data points in each hypercube of feature space will decrease exponentially with $m$.

### Exercise: a full Bayesian classifier based on scikit-learn

Here we will sketch an implementation of a full Bayesian model for classification. This would suit low-dimensional continuous or discrete problems.

### Density estimation

Scikit-Learn has a few algorithms for density estimation:
- kernel density
- Gaussian mixtures
- etc.

This can provide one part of the full Bayesian model: the likelihood.

In [None]:
from sklearn.neighbors.kde import KernelDensity

In [None]:
kde = KernelDensity()

In [None]:
kde.fit(X)

In [None]:
kde.__class__.__bases__

In [None]:
from sklearn.base import BaseEstimator

### Exercise: look at the help for sklearn's `BaseEstimator` interface

## Classification

Classifiers in scikit-learn have a different interface.

In [None]:
from sklearn.linear_model import LogisticRegression   # a classifier!

In [None]:
def non_special(method_name):
    return not(method_name.startswith('__') and method_name.endswith('__'))

list(filter(non_special, dir(LogisticRegression)))

In [None]:
clf = LogisticRegression()

In [None]:
clf.fit(X, y)

In [None]:
LogisticRegression.predict_proba??

In [None]:
clf.predict_log_proba(X).shape

In [None]:
X.shape

In [None]:
clf.predict??

In [None]:
clf.predict(X)

In [None]:
LogisticRegression.score??

In [None]:
from sklearn.svm import SVC

In [None]:
SVC.score??

In [None]:
from sklearn.naive_bayes import GaussianNB, BaseNB

In [None]:
BaseNB??

In [None]:
from sklearn.utils.multiclass import unique_labels
from sklearn.utils import check_array, check_X_y

In [None]:
u = unique_labels([1, 2, 1, 3, 5])

In [None]:
find([1, 2, 1, 3, 5], u)

## Extended exercise: implement a `BayesClassifier` class that inherits from `BaseNB`

### Solution skeleton

In [None]:
class BayesClassifier(BaseNB):
    """
    p(C | X) \prop p(X | C) p(C)
    
    Performs classification using MAP and
    also uncertainty estimation based on the full model.
    
    Has two components, both themselves scikit-learn Esimator objects:
    1. density_estimator: p(X | C). A class. This will be instantiated for each class.
       It must have a 'predict_log_proba' method.
       
    2. priors: ndarray of length len(C).
    """
    def __init__(self, density_estimator, priors):
        assert hasattr(density_estimator, 'score_samples')
        self.likelihood_fn = density_estimator
        self.prior = priors
        self.n_classes = len(priors)
        self.conditional_models = {}
        #self.log_likelihoods = np.zeros(self.n_classes)
        
    def fit(self, X, y):
        """
        Fit a model for each class.
        """
        X, y = check_X_y(X, y, ensure_min_samples=2, estimator=self)
        self.classes_ = unique_labels(y)

        indices = find(y, self.classes_)
        for c in range(self.n_classes):
            rows = X[indices[c]]
            est = self.likelihood_fn()
            est.fit(rows)
            self.conditional_models[c] = est
    
    def _joint_log_likelihood(self, X):
        """Compute the unnormalized posterior log probability of X

        I.e. ``log P(c) + log P(x|c)`` for all rows x of X, as an array-like of
        shape [n_classes, n_samples].

        Input is passed to _joint_log_likelihood as-is by predict,
        predict_proba and predict_log_proba.
        """
        X = check_array(X)
        joint_log_likelihood = []
        for i in range(np.size(self.classes_)):
            log_p_c = np.log(self.class_prior_[i])
            log_p_x_given_c = self.conditional_models[i](X)
            joint_log_likelihood.append(log_p_c + log_p_x_given_c)

        joint_log_likelihood = np.array(joint_log_likelihood).T
        return joint_log_likelihood

### Exercise: Try this out with e.g. the iris data or diabetes data

In [2]:
from sklearn.datasets import load_iris, load_diabetes

iris = load_iris()

In [None]:
print(iris['DESCR'])

In [None]:
iris.keys()

In [None]:
X = iris.data
y = iris.target

In [None]:
iris['feature_names']

In [None]:
diabetes = fetch_mldata('diabetes')

In [None]:
diabetes.keys()

In [None]:
print(diabetes['COL_NAMES'])

### Simplifying assumption for higher dimensions: "Naive" Bayes

A family of classifiers known as **Naive Bayes** simplifies the likelihood model (at the expense of its expressive power) by making the (strong) simplifying assumption of conditional independence:

$$
p(\*x \given C_k) \approx \prod_{i=1}^m p(x_i \given C_k)
$$

In addition, further simplifying assumptions are sometimes made for various problems:
- that discrete binary values (booleans) are generated from **Bernoulli** trials.
- that discrete values representing frequencies of events are distributed according to a **multinomial** distribution (a generalization of the Bernoulli and binomial distributions).
- that continuous values associated with each class are **normally** distributed ("Gaussian").

#### Recall that these assumptions corresponding to certain choices of **prior information** expressed as feature constraints in exponential-family models:

- Bernoulli: constraint $E(K) = p$ for sample space $k \in \{0,1\}$.
- Multinomial: XXXXXX
- Gaussian: constraints on $E(X)$ and $E(X^2)$


### Simplifying assumption for higher dimensions: "Naive" Bayes

A family of classifiers known as **Naive Bayes** simplifies the likelihood model (at the expense of its expressive power) by making the (strong) simplifying assumption of conditional independence:

$$
p(\*x \given C_k) \approx \prod_{i=1}^m p(x_i \given C_k)
$$

In addition, further simplifying assumptions are sometimes made for various problems:
- that discrete binary values (booleans) are generated from **Bernoulli** trials.
- that discrete values representing frequencies of events are distributed according to a **multinomial** distribution (a generalization of the Bernoulli and binomial distributions).
- that continuous values associated with each class are **normally** distributed ("Gaussian").


#### Recall what these assumptions mean:

These assumptions correspond to certain choices of **prior information** expressed as feature constraints in exponential-family models:

- Bernoulli: constraint $E(K) = p$ for sample space $k \in \{0,1\}$.
- Gaussian: constraints on $E(X)$ and $E(X^2)$
- etc.

See https://en.wikipedia.org/wiki/Maximum_entropy_probability_distribution