# <span style='font-family:Marker Felt;font-weight:bold'>LINEAR MODELS FOR CLASSIFICATION</span>

---

In this chapter we start to consider the problem of classification in which, differently from the previous chapter the target variables is discrete. For the moment just the basic algorithms will be presented, leaving more advanced methods for the next chapters.

## <span style='font-family:Marker Felt;font-weight:bold'>Linear classification</span>

---

The problem is similar to the one in the previous notebook but in this case the outputs are classes and not continuous variables. In classification, given an input $x$ we want to classify it into one of the $C_k$ classes where $k=1,...,K$. In this case, there is no noise in the output, an apple is an apple no matter what uncertainty we have in the experiment. An example can be: given as input the pixels of an X-ray image we want to say if there is cancer or not. In principle we can use the same techniques used in the regression problem but, in the end, we will obtain really bad solutions.

In linear classification, the input space is divided into decision regions whose boundaries are called **decision boundaries**. In this chapter we will focus just in linear models for classification which are able to work with dataset whose classes can be separated by linear decision surfaces. Let's consider the linear regression model:

$$
\begin{align}
y(\textbf x, \textbf w) = w_0 + \sum_{j=1}^{D-1} w_j x_j = \textbf x^T \textbf w + w_0 
\end{align}
$$

In classification we can work with categorical variables as *male* or *female*. We can convert these classes into numbers, for example, the classes *male* and *female* can be converted into $0$ and $1$. The problem is that the codomain of the linear model is not $0$ and $1$, but $-\infty$ and $+\infty$ causing incompatibility with our classes. In this case it is better to use the **generalized linear model** in which the output is not directly the weighted sum of the inputs but, is the result of a nonlinear activation function applied to this sum.

$$
\begin{align}
y(\textbf x, \textbf w) = f(\textbf x^T \textbf w + w_0)
\end{align}
$$

The model $y$ is no more linear in the parameters since they pass through a nonlinear function. Why we still call this model linear? We can answer this question by looking at the shape of the boundaries that separate the two classes into the input space. We can plot the **decision surface** by putting:

$$
\begin{align}
y(\textbf x, \textbf w) = const 
\end{align}
$$

We see that these surfaces are linear functions of the inputs and the parameters since they correspond to hyperplanes. As we did for linear regression problem, also here, we can transform the input space with basis functions but we'll see this later.

In classification, there are different ways of representing the target values. In the two-class problem we have a binary target $t \in \{0, 1 \}$ and we can think at the output as a probability distribution in which $t=1$ means that the class is $\mathcal C_1$ and $t=0$ means that the class is $\mathcal C_2$. This is the point of view of probabilistic models. If the classes are more generally $K$, instead of just $2$, we can use the $1$-of-$K$ coding scheme on which $\textbf t$ is a vector of length $K$ that contains $1$ for the correct class and $0$ for the others. For example, if we have $5$ classes and the correct class is $\mathcal C_5$, then the vector $\textbf t$ will be $\{0, 0, 0, 0, 1\}$. In this, case we can interpret the vector $\textbf t$ as a probability distribution over classes.

As already said in a previous chapter, there are three different approaches to classification. The simplest approach makes use of a **discriminant function**, in which we build a function that directly maps each input to a specific class. In a **probabilistic** approach we model the conditional probability distribution $p(\mathcal C_k | \textbf x)$ to make optimal decisions. We have to alternatives in a probabilistic view. 

* The *discriminative* approach models the conditional probability directly using for example parametric models and optimizing the parameters using a training set.

* The *generative* approach models first the class conditional probabilities $p(\textbf x | \mathcal C_k)$ together with the prior of the class and then applies the Bayes' rule:

$$
\begin{align}
p(\mathcal C_k | \textbf x) = \frac{p(\textbf x | \mathcal C_k)p(\mathcal C_k)}{p(\textbf x} 
\end{align}
$$

## <span style='font-family:Marker Felt;font-weight:bold'>Discriminant functions</span>

---

Let's start considering the simplest approach in which we build a function that directly takes an input vector $\textbf x$ and assigns it to one of the $K$ classes.

### <span style='font-family:Marker Felt;font-weight:bold'>Two classes</span>

The simplest discriminant function is obtained by taking a linear combination of the inputs as follow:

$$
\begin{align}
y(\textbf x) = \textbf w^T \textbf x + w_0 
\end{align}
$$

This model assigns an input $\textbf x$ to the class $\mathcal C_1$ if $y(\textbf x) \geq 0$ and to the class $\mathcal C_2$ otherwise. The decision boundary is therefore defined by $y(\textbf x) = 0$ which correspond to a $(D-1)$-dimensional hyperplane in the input space.

It is possible to demonstrate that the weight vector $\textbf w$ is orthogonal to the decision surface. Taking two different points onto the decision boundary we have :

$$
\begin{align}
y(\textbf x_A) = y(\textbf x_B) = 0 
\end{align}
$$

$$
\begin{align}
\textbf w^T(\textbf x_A - \textbf x_B) = 0
\end{align}
$$

The last equation is satisfied only if the weights vector is orthogonal to the line connecting the two points. We can also prove that $w_0$ represents the location of the decision surface. If we take a point onto the decision surface we have: 

$$
\begin{align}
\frac{\textbf w^T \textbf x}{\left\lVert \textbf w \right\rVert_2} = - \frac{w_0}{\left\lVert \textbf w \right\rVert_2} 
\end{align}
$$

<p align="center">
<img src="https://github.com/stepyt/machine_learning_notes/blob/master/storage/pict3/figure3_1.png?raw=1" width="300">
</p>

### <span style='font-family:Marker Felt;font-weight:bold'>Multiple classes</span>

To extend our model to the case of $K>2$ classes we have different possibilities.

Consider the case of *one-versus-the-rest* classifiers in which we build $K-1$ classifiers, each of which solves a two-class problem separating the class $\mathcal C_k$ from the others. An alternative is to use *one-versus-one* classifiers in which the $K(K-1)/2$ binary discriminant functions separate pairs of classes. The resulting decision boundaries of the two methods are reported in the pictures below.

<p align="center">
<img src="https://github.com/stepyt/machine_learning_notes/blob/master/storage/pict3/figure3_2.png?raw=1" width="400">
</p>

It is possible to see that in both the cases there are regions of ambiguity. We can solve this problem by considering a single $K$-class discriminant comprising $K$ linear functions of the form:

$$
\begin{align}
y_k(\textbf x) = \textbf w_k^T \textbf x + w_{k0} 
\end{align}
$$

Then, we assign a point to the class $ \mathcal C_k $ if $y_k(\textbf x) > y_j(\textbf x) $ for all $j \neq k$. The decision boundary between class $k$ and $j$ is therefore given by $y_k(\textbf x) = y_j(\textbf x)$ and hence correspond to a $(D-1)$-dimensional hyperplane. Can be easily demonstrated that the decision boundaries are always singly connected and convex. Let's take two points inside the region $\mathcal R_k$, any point that lies on the line connecting these two points can be expressed as:

$$
\begin{align}
\hat{\textbf x}= \lambda \textbf x_A + (1- \lambda) \textbf x_B 
\end{align}
$$

Thanks to the linearity of the discriminant :

$$
\begin{align}
y_k(\hat{\textbf x})= \lambda y_k(\textbf x_A) + (1- \lambda) y_k(\textbf x_B)
\end{align}
$$

It follows that $y_k(\textbf x_A) > y_j(\textbf x_A)$ and $y_k(\textbf x_B) > y_j(\textbf x_B)$, hence $y_k(\hat{\textbf x}) > y_j(\hat{\textbf x})$ for all $j \neq k$ and so $\hat{\textbf x}$ belongs to $\mathcal R_k$.

<p align="center">
<img src="https://github.com/stepyt/machine_learning_notes/blob/master/storage/pict3/figure3_3.png?raw=1" width="300">
</p>

### <span style='font-family:Marker Felt;font-weight:bold'>Least squares for classification</span>

Considering a general classification problem with $K$ classes using the $1$-of-$K$ encoding scheme we can use the least squares in order to approximate the conditional expectation $ \mathbb E[\textbf t| \textbf x] $. Each class is described by its own linear model and we can conveniently write the model in matrix notation:

$$
\begin{align}
\textbf{y(x)} = \widetilde{\textbf W}^T \widetilde{\textbf x} 
\end{align}
$$

where $\widetilde{\textbf W}$ is a $(D+1) \times K$ matrix whose $k$-th column is $\widetilde{\textbf w}_k = (w_{k0}, \textbf w_k^T)^T $ and $\widetilde{\textbf x} = (1, \textbf x^T)^T$. We can determine the parameters matrix by minimizing the sum-of-squares error function. Given a dataset $\mathcal D=\{\textbf x_i, t_i\}$ where $n=1,...,N$ the least-squares solution is:

$$
\begin{align}
\widetilde{\textbf W} = (\widetilde{\textbf X}^T \widetilde{\textbf X})^{-1} \widetilde{\textbf X}^T \textbf T \end{align}
$$

where $\textbf{T}$ is a $N \times K$ matrix whose $n$-th row is the  vector $\textbf t_n^T$  and $ \widetilde{\textbf X}$ is an $N \times (D+1)$ matrix whose $i$-th row is $\widetilde{\textbf x}_i^T$. A new input is assigned to the class for which $t_k = \widetilde{\textbf x}^T \widetilde{\textbf w}_k$ is largest.

Despite the least squares method provides a closed form solution of the parameters it lacks of robustness to outliers as clearly visible from the pictures below. The green line is obtained with logistic regression (presented later) and the purple one is obtained with least-squares.

<p align="center">
<img src="https://github.com/stepyt/machine_learning_notes/blob/master/storage/pict3/figure3_4.png?raw=1" width="400">
</p>

Another problem deriving from the least-squares method is that of non-Gaussian distributions. Clearly, the binary target vectors have a distribution that is far from Gaussian. In linear regression models we add the Gaussian distribution through the noise model. In the pictures below we have on the rleft the classification boundaries provided by least-squares and on the right the ones provided by logistic regression. In this case, despite the three classes can be easily separated by linear boundaries the method completely fails in do this. 

<p align="center">
<img src="https://github.com/stepyt/machine_learning_notes/blob/master/storage/pict3/figure3_5.png?raw=1" width="400">
</p>

### <span style='font-family:Marker Felt;font-weight:bold'>Fixed basis functions</span>

As in the case of regression, we can work in the original input space or we can operate a nonlinear transformation of it using a vector of *basis functions* $\phi(\textbf x)$. The resulting decision boundaries will be linear in the feature space $\phi$ but will be non linear in the original input space. In this way, we can obtain a linear separation in the new space, also if in the original one would be impossible. An example of this approach is reported in the pictures below.

<p align="center">
<img src="https://github.com/stepyt/machine_learning_notes/blob/master/storage/pict3/figure3_6.png?raw=1" width="450">
</p>

The red and blue dots correspond to two different classes. Obviously, it is impossible in the figure on the left to generate a linear boundary able to separate blue and red dots. If we take two *Gaussian basis function*, $\phi_1(\textbf x)$ and $\phi_2(\textbf x)$ we can separate the two classes easily in the figure on the right. Another approach could be to transform the original features into polar coordinates. In the new features space the horizontal axis will be the radius $\rho$ and the vertical one the angle $\theta$. Also in linear classification, we have seen that the big challenge is to find the best way of representing the input variables.

### <span style='font-family:Marker Felt;font-weight:bold'>The perceptron</span>

Another discriminant (not probabilistic) model for linear classification is the *perceptron* of Rosemblat (1962). This is an online algorithm, so it is able to process one sample at time and can manage two-class problems. The first step of the perceptron is to transform the original input space with fixed nonlinear basis functions, then it uses these new features to construct a generalized linear model.

$$
\begin{align}
y(\textbf x) = f(\textbf w^T \phi(\textbf x)) 
\end{align}
$$

The nonlinear activation function $f(\cdot)$ is given by a sign function:

$$
\begin{align}
f(a) = 
\begin{cases}
+1, \quad a \geq 0 \\
-1, \quad a < 0
\end{cases}
\end{align}
$$

We assign the value of $+1$ to the class $\mathcal C_1$ and the value $-1$ to the class $\mathcal C_2$. One possibility to learn the parameters of the model would be the minimization of the misclassified points. This approach doesn't lead to a good approach since the error function would be a piecewise constant function of $\textbf w$. To understand this just imagine moving the decision boundary without increase or decrease the number of misclassified points, the error function will remain constant. By considering the optimization of the parameters with this method through gradient descent we will have a lot of plateaux. A better choice of the error function is what is called the **perceptron criterion**. We are looking for a function such that in class $\mathcal C_1$ we have $\textbf w^t \phi(\textbf x) > 0$ and the opposite for the class $\mathcal C_2$. If the target is $t \in \{-1, +1 \}$ we would like that all the inputs satisfy $\textbf w^T \phi(\textbf x_n) t_n > 0$. The perceptron algorithm assigns zero error to correctly classified inputs. The perceptron criterion uses a loss function given by:

$$
\begin{align}
L_P(\textbf w) = -\sum_{n \in \mathcal M} \textbf w^T \phi_n t_n
\end{align}
$$

where $\mathcal M$ represents the set of all misclassified points. In this way, the error function is piecewise linear in $\textbf w$. The optimization of the parameters is obtained with Stochastic Gradient Descent (SGD) as follow:

$$
\begin{align}
\textbf w^{(k + 1)} = \textbf w^{(k)} - \alpha \nabla L_P(\textbf w) = \textbf w^{(k)} + \alpha \phi(\textbf x_n) t_n
\end{align}
$$

where $\alpha$ is the learning rate. Since multiplying $\textbf w$ by a constant term the perceptron function does not change we can set the learning rate equal to one. The pseudo code of the perceptron is:

<p align="center">
<img src="https://github.com/stepyt/machine_learning_notes/blob/master/storage/pict3/figure3_7.png?raw=1" width="200">
</p>

An update example of the perceptron algorithm is illustrated in the figures below.

<p align="center">
<img src="https://github.com/stepyt/machine_learning_notes/blob/master/storage/pict3/figure3_14.png?raw=1" width="500">
</p>

The black arrow represents the actual parameter vector $\textbf w$ while the point with a green circle around is the misclassified one. The feature vector of the misclassified point is added to the current weight vector giving the new decision boundary. Since the change in the weights vector may cause some previously correctly classified points to be misclassified, the perceptron rule doesn't guarantee to reduce the total error at each step. The effect of a single update is just to reduce the error due to the misclassified pattern. 

$$
\begin{align}
- \textbf w^{(k+1)T} \phi_n t_n = - \textbf w^{(k)T} \phi_n t_n - (\phi_n t_n)^T\phi_n t_n < - \textbf w^{(k)T} \phi_n t_n 
\end{align}
$$

The *perceptron convergence theorem* states that if there exists an exact solution, so if the training set is linearly separable in the feature space $ \bf \Phi $, then the perceptron learning algorithm is guaranteed to find the exact solution in a finite number of steps. However, the convergence can be really slow and we may not be able to distinguish between non-separable problems and slowly converging ones. If multiple solutions exist, the one found at convergence depends on the initialization of the parameters and on the order of presentation of the data points. For dataset that are not linearly separable the perceptron will never converge.

## <span style='font-family:Marker Felt;font-weight:bold'>Probabilistic discriminative approach</span>

---

### <span style='font-family:Marker Felt;font-weight:bold'>Logistic regression</span>

We will now take a step forward considering one of the most used algorithms for classification. Let's start considering the case of two classes and the posterior probability for the first class.

$$
\begin{align}
p(\mathcal C_1 | \textbf x) = \frac{p(\textbf x | \mathcal C_1) p(\mathcal C_1)}{p(\textbf x | \mathcal C_1) p(\mathcal C_1) + p(\textbf x | \mathcal C_2) p(\mathcal C_2)} =
\frac{1}{1 + exp(-a)} = \sigma(a) 
\end{align}
$$

where $\sigma(a)$ is called *logistic sigmoid function*. We can thus write the posterior of class $\mathcal C_1$ as a logistic sigmoid acting on a linear function of the feature vector as:

$$
\begin{align}
p(\mathcal C_1 | \phi) = y(\phi) = \sigma(\textbf w^T \phi)
\end{align}
$$

Consequently $p(\mathcal C_2 | \phi) = 1 - p(\mathcal C_1 | \phi)$. This is a generalized linear model since, as in previous cases, the boundary is defined by $\sigma(\textbf w^T \phi) =$ const. Despite the logistic sigmoid function is used for classification this model is called *logistic regression*. The shape of the logistic sigmoid is depicted in the figure below.

<p align="center">
<img src="https://github.com/stepyt/machine_learning_notes/blob/master/storage/pict3/figure3_12.png?raw=1" width="300">
</p>

The parameters of the logistic regression model are determined through maximum likelihood. Given a dataset $\mathcal D = \{\textbf x_n, t_n\}$ for $n=1,...,N$ and $t \in \{0, 1\}$, the likelihood function is given by:

$$
\begin{align}
p(\textbf t | \textbf X, \textbf w) = \prod_{n=1}^N y_n^{t_n}(1 - y_n)^{1-t_n} 
\end{align}
$$

where $y_n = \sigma(\textbf w^T \phi_n)$. We can define the error function by taking the negative log of the likelihood, obtaining the **cross entropy error function**:

$$
\begin{align}
L(\textbf w) = - \ln p(\textbf t | \textbf X, \textbf w) = - \sum_{n=1}^N(t_n \ln y_n - (1 - t_n) \ln(1 - y_n)) = \sum_{n=1}^N L_n
\end{align}
$$

In this case there is no a closed form solution, due to the nonlinearity of the sigmoid, but we can compute the gradient of the loss function w.r.t. the parameters obtaining

$$
\begin{align}
\nabla L(\textbf w) = \sum_{n=1}^N(y_n - t_n) \phi_n 
\end{align}
$$

We can notice that it takes exactly the same form as the gradient of the sum-of-squares error function for the linear regression model. The error function is convex, so a global optimum exists and can be found with gradient-based optimization and adopting online methods.

<div class="alert alert-block alert-warning">
<b>NOTE (Deriving logistic regression):</b>
The idea is that we want to use the same linear equation to compute probability of two classes. First of all, in binary classification the output can be $0$ or $1$ but in a linear model it can goes from $-\infty$ to $+\infty$. Let's bound the linear equation between $0$ and $1$ in order to have a valid probability function.
    
$$ y(\textbf w) = \textbf w^T \textbf x $$

$$ p = \frac{e^y}{e^y + 1} $$

Let's take now the odds ratio, or rather the ratio between the probability of the positive event and the probability of the negative one.

$$ odds(p) = \frac{p}{1 - p} = e^y $$

By taking the log of the odds we have:

$$ \log\Big(\frac{p}{1-p}\Big) = y $$
</div>

### <span style='font-family:Marker Felt;font-weight:bold'>Multiclass logistic regression</span>

In the multiclass case the posterior probabilities are represented through the generalization of the logistic sigmoid called **softmax function**. This name comes from the fact that the function is a smoothed version of the 'max' function. The posterior can be written as:

$$
\begin{align}
p(\mathcal C_k | \phi) = y_k(\phi) = \frac{exp(\textbf w_k^T \phi)}{\sum_j exp(\textbf w_j^T \phi)} 
\end{align}
$$

The denominator guarantees that the probability will be bounded between $0$ and $1$. At this point we compute the value of the parameters directly with maximum likelihood. It is easy to write down the likelihood using the $1$-of-$K$ coding scheme where the target is binary:

$$
\begin{align}
p(\textbf T | \textbf w_1, ..., \textbf w_K) = \prod_{n=1}^N \prod_{k=1}^Kp(\mathcal C_k | \phi_n )^{t_{nk} } = \prod_{n=1}^N \prod_{k=1}^K y_{nk}^{t_{nk}} 
\end{align}
$$

where $\textbf T$ is a $N \times K$ matrix of target variables and:

$$
\begin{align}
y_{nk} = p(\mathcal C_k | \phi_n) = \frac{exp(\textbf w_k^T \phi)}{\sum_j exp(\textbf w_j^T \phi)} 
\end{align}
$$

Taking the negative logarithm gives the **cross-entropy function**:

$$
\begin{align}
L(\textbf w_1, ..., \textbf w_K) = - \ln p(\textbf T | \textbf w_1, ..., \textbf w_K) = - \sum_{n=1}^N \sum_{k=1}^K t_{nk} \ln y_{nk} 
\end{align}
$$

Taking the gradient w.r.t. a generic parameter vector, for example $\textbf w_j$, we obtain:

$$
\begin{align}
\nabla _{\textbf w_j} E(\textbf w_1, ..., \textbf w_K) = \sum_{n=1}^N(y_{nj} - t_{nj}) \phi_n 
\end{align}
$$

Once again, we found the same form that was found for the sum-of-squares error function with the linear model and the cross-entropy for the logistic regression. Also in this case we can apply a sequential algorithm for the learning. As one can observe from the plot of the logistic function, it is very similar to the step function:

<p align="center">
<img src="https://github.com/stepyt/machine_learning_notes/blob/master/storage/pict3/figure3_13.png?raw=1" width="600">
</p>

The main difference is that the logistic is smoother and this allows to remove the instability related to nonlinearly separable variables in the input space that prevent the algorithm to converge. The logistic regression will converge also in the case of nonlinearly separable features in a point in which there will be some misclassified value.

> With this notebook we have finished all the *basic* tools of machine learning

<span style='font-family:Marker Felt;font-size:20pt;color:DarkCyan;font-weight:bold'>References:</span>

1. *Restelli M., Machine Learning - course slides*

2. *Bishop C.M., "Pattern Recognition and Machine Learning", chapter: 4*