In [2]:
import numpy as np
import matplotlib.pyplot as plt

# Show that the classification program leads to P(Ck|x) = softmax avec a(k) = ln P(x|Ck)P(Ck)
# Talk about the generative (model the process generating data) vs discriminative (logistic regression here) classification
# Show that the number of parameters to adjust in each case is not of the same (quadratic in dimensions for generative)

# General approach
---

* Discriminative
* Generative

# Generative
---

In the generative approach, we try to find an appropriate data generation process for the different classes we are interested in:

&emsp; $p(x|C_k), \forall k$.

Then we use these **forward probabilities** (probability of effect knowing the cause) to compute the **backward probabilities** (probability of cause knowing the effect, that is in our case, the probability that point $x$ belongs to class $C_k$), using Bayes' formula:

&emsp; $p(C_k|x) = \frac{p(x|C_k)p(C_k)}{\sum_i p(x|C_i)p(C_i)}$

Then we can take the $C_k$ with the highest probability, which corresponds to MAP (Maximum A Posteriory) and not ML (Maximum Likelihood). A full Bayesian treatment would also include prior on the parameters (discussed later).

### Motivation

The motivation for doing this is that the forward probabilities are usually:

* more stable than backward probabilities which depends on other factors, like the priors of each classes
* more intuitive for humans than backward probabilities, because their correspond to causal direction

For instance, in the case of the diagnostic of a disease:

* forward probabilities are the false positive and false negative rate of the diagnostic tool
* backward probabilities inform of the probability of having the disease given the diagnostic
* backward probabilities (evidence => cause) are what we want, but they depend on the prior $P(disease)$

### Disadvantages

The big disadvantage is that forward probabilities can be pretty complex to estimate, and will usually require quite a lot more of parameters to estimate. For instance, if $x$ is M-dimensional, fitting a gaussian for a single class will require:

* up to $\frac{M(M+1)}{2}$ parameters for the covariance matrix
* up to $M$ parameters for the mean

In comparison, a simple logistic regression model will only need to adjust $M$ parameters.

### Softmax, and Linear models

Following our generative perspective, if $p(x|C_k)$ is gaussian: $p(x|C_k) = \frac{1}{(2 \pi)^{D/2} |\Sigma|^{1/2}} e^{-\frac{1}{2}(x - \mu)^T \Sigma^{-1} (x - \mu)}$

Thus we can rewrite Bayes' rule like this:

&emsp; $p(C_k|x) = \frac{p(x|C_k)p(C_k)}{\sum_i p(x|C_i)p(C_i)} = \frac{exp(a_k)}{\sum_i exp(a_i)}$

Where we have:

&emsp; $a_k = \log p(x|C_k)p(C_k) = \log p(C_k) - \frac{1}{2} \log ((2 \pi)^D |\Sigma|) -\frac{1}{2}(x - \mu)^T \Sigma^{-1} (x - \mu)$

&emsp; $a_k = \log p(C_k) - \frac{D}{2} \log 2 \pi - \frac{1}{2} \log |\Sigma| - \frac{1}{2} (x^T \Sigma^{-1} x + \mu^T \Sigma^{-1} \mu - 2 \mu^T \Sigma^{-1} x) $

We can simplify the terms by first removing the common part (which will cancel out as common factors) and, if we assume all classes to have the same covariance matrix, get this form, where the quadratic terms have vanished:

&emsp; $a_k = \log p(C_k) - \frac{1}{2}(\mu^T \Sigma^{-1} \mu - 2 \mu^T \Sigma^{-1} x)$

We thereform got a linear form in $x$, which we can put back to the softmax function, to finally get:

&emsp; $p(C_k|x) = \frac{exp(w_k^T x)}{\sum_i exp(w_i^T x)}$

# Discriminiative
---

Apply maximum likelihood on the parameters $w$ for the softmax formula: and you get the notion of logistic regression.