In [None]:
import numpy as np

### 1. Logistic and probit regression. Bayesian logistic regression. Laplace approximation

#### 1.1 Classification problem
The natural starting point for discussing approaches to classification is the joint probability $p(y, \mathbf{x})$, where $y$ denotes the class label. Using Bayes' theorem this joint probability can be decomposed either as $p(y) p(\mathbf{x}|y)$ or as $p(\mathbf{x}) p(y|\mathbf{x})$. This gives rise to two different approaches to classification problems. The first, which we call the generative approach, models the class-conditional distributions $p(\mathbf{x}|y)$ for $y=\mathcal{C}_1, \ldots, \mathcal{C}_C$ and also the prior probabilities of each class, and then computes the posterior probability for each class using

$$
p(y|\mathbf{x})=\frac{p(y) p(\mathbf{x}|y)}{\sum_{c=1}^C p\left(\mathcal{C}_c\right) p\left(\mathbf{x}|\mathcal{C}_c\right)}
$$


The alternative approach, which we call the discriminative approach, focusses on modelling $p(y|\mathbf{x})$ directly.  
To turn both the generative and discriminative approaches into practical methods we will need to create models for either $p(\mathbf{x}|y)$, or $p(y|\mathbf{x})$ respectively.
#### 1.2 Logistic and probit regression
##### 1.2.1 Modeling
For the binary discriminative case one simple idea is to turn the output of a regression model into a class probability using a response function (the inverse of a link function), which "squashes" its argument, which can lie in the domain $(-\infty, \infty)$, into the range $[0,1]$, guaranteeing a valid probabilistic interpretation.

One example is the **linear logistic regression** model

$$
p\left(y=1|\mathbf{x}, \boldsymbol{\theta}\right)=\lambda\left(\mathbf{x}^{\top} \boldsymbol{\theta}\right), \quad \text { where } \lambda(z)=\frac{1}{1+\exp (-z)}, \quad \mathbf{x},\boldsymbol{\theta}\in \mathbb{R}^D,\, y\in\{0,1\}
$$

which combines the linear model with the logistic response function. Another common choice of response function is the cumulative density function of a standard normal distribution $\Phi(z)=\int_{-\infty}^z \mathcal{N}(x|0,1) d x$. This approach is known as **probit regression**.

We have datasets $\mathcal{D} = \{(\mathbf{x}_i, y_i)\}_{i=1}^n,\, \text{where } \mathbf{x}_i, y_i\sim p(\mathbf{x}, y)$ $-$ iid samples.
Assuming that $\mathbf{x}$ is a uniformly distributed on finite support random value, we can write down probability of observed data given parameters $\boldsymbol{\theta}$, i.e *likelihood*:

$$
\mathcal{L}(\boldsymbol{\theta})=p(\mathcal{D}|\boldsymbol{\theta}) = p((\mathbf{x}_1, y_1),\dots,(\mathbf{x}_n, y_n)|\boldsymbol{\theta}) \underset{\text{iid}}{=} \prod_{i=1}^n p(\mathbf{x}_i,y_i|\boldsymbol{\theta}) = \prod_{i=1}^n p(y_i|\mathbf{x}_i, \boldsymbol{\theta})p(\mathbf{x}_i) = C\prod_{i=1}^n p(y_i|\mathbf{x}_i, \boldsymbol{\theta}), \quad \text{where } C = c^n,\, c - \text{constant dencity of }\mathbf{x}_i
$$

In linear logistic regression approach, we model class probabilities in following way:  
$p(y=1|\mathbf{x}_i, \boldsymbol{\theta}) = \lambda\left(\mathbf{x}_i^{\top} \boldsymbol{\theta}\right)\equiv p_i$,  
$p(y=0|\mathbf{x}_i,\boldsymbol{\theta}) = 1-\lambda\left(\mathbf{x}_i^{\top} \boldsymbol{\theta}\right)\equiv 1 - p_i$

With this model and notation, we can rewrite likelihood:
$$
\mathcal{L}(\boldsymbol{\theta}) = C\prod_{i=1}^n p^{y_i}_i(1-p_i)^{1-y_i}
$$ 

##### 1.2.2 MLE arroach
We can assume that papameters $\boldsymbol{\theta}$ are deterministic and unknown. We want to find most suited values $\boldsymbol{\theta}^*$. With assumptions above, a reasonable value would be Maximum Likelihood Estimate (MLE):

$$
\boldsymbol{\theta}^* = \underset{\boldsymbol{\theta}}{\text{argmax }}\mathcal{L}(\boldsymbol{\theta})
$$
which yield estimate with highest observed data probability.

$$
\underset{\boldsymbol{\theta}}{\text{argmax }}\mathcal{L}(\boldsymbol{\theta}) = \underset{\boldsymbol{\theta}}{\text{argmax }}\log p(\mathcal{D}|\boldsymbol{\theta}) = \underset{\boldsymbol{\theta}}{\text{argmax }} \sum^n_{i=1} y_i\log p_i + (1-y_i)\log (1-p_i)
$$

### 2. Relevance vector machine (RVM)