# 1) Linear models

* Supervised learning (briefly)
* Perceptron (briefly)
* Linear Regression
    * Hypothesis set
    * Error function
    * Formula for $w^\ast$
* Logistic Regression (classificaion)
    * Hypothesis set
    * Sigmoid function
    * Error function
    * Gradient descent
    * Softmax 
* Non-linear transforms (briefly)

### Supervised learning

In the Supervised Learning setting we are given a dataset $D = \{(x_1,y_1),...,(x_n,y_n)\}$ to learn from. The way we "learn" is by using D to search a hypothesis space H for some "best" hypothesis $g\in H$. By "best", we mean that we want g to be as close to the __ unknown target function f__ as possible, i.e.: 

$$g \approx f$$

We most often find g by minimizing an in-sample error function $E_{in}(h)$. The process of finding g is what we call "training" and when we have found it, we can use it to predict on new inputs.

Within supervised learning we distinguish between __two subcategories__. If the output y is a real number we call it __regression__. If the output y is a member of a discrete set - e.g. $y \in \{red,blue,green\}$ - then we call it __classification__.

In Supervised Learning we try to minimize $E_{in}(g)$, but what we _really_ care about in the end is having a low out-of-sample error $E_{out}(g)$, that is, in the end we only care about how g performs on _new_ data. Since we cannot measure $E_{out}(g)$ during training, we hope that we can generalize such that $E_{in}(g)$ - which we _can_ measure - is close to $E_{out}(g)$. 


### Perceptron
The perceptron is the simplest of the linear models. It is a binary __classification__ model ({-1,+1}), and the hypothesis set looks as follows:

$$ H=\{h(x)=\text{sign}(w^T x) \mid w\in \mathbb{R}^{d+1}\}$$

The __Perceptron Learning Algorithm(PLA)__ search for a hyperplane which separate all datapoints correctly. While a misclassified point x exists, the algorithm will update the hypothesis, effectively moving the hyperplane in the direction of correctly classifying x. The algorithm never terminate if the given data is not linearly separable, but if the data is linearly separable, it is guaranteed to pick a $g\in H$ with $E_{in} = 0$. 

The error function we wish to minimize is:

$$E_{in}(h)=\frac{1}{N}\sum_{i=1}^N \mathbb{1}_{h(x_i)\neq y_i} $$

Where $\mathbb{1}_{h(x_i)\neq y_i} = 1$ if h misclassifies $x_i$, and 0 if not.  

With a simple augmentation, the PLA can be used on non linearly separable datasets. The idea is to keep the currently best h "in a pocket", while runing for i iterations(instead of until $E_{in} =0$). After the i iterations, the algorithm stops and returns the stored hypothesis. 

### Linear Regression

In Linear Regression, we are in the __regression__ setting. We wish to find the hyperplane which best explain the observed datapoints in D. This hyperplane can then be used to output real numbers based on new input. The hypothesis set looks as follows:

$$ H=\{h(x)=w^T x \mid w\in \mathbb{R}^{d+1}\}$$

Notice that it is similar to the hypothesis set of the Perceptron, the only difference is that here we don't take the sign() function on the signal: $w^T x$. 

#### Training
In most learning situations we have to settle for a near-optimal solution. But in Linear Regression the in-sample error function, $E_{in}(h) =  \frac{1}{N}\sum_{i=1}^N (h(x_i)-y_i)^2$ called __least squares__,  is "simple" enough to allow us to find a __closed-form solution__. Thus we can - in a sort of one step learning process - pin down a weight vector $w^*$, which define the __optimal hyperplane__ for the given dataset. The equation looks as follows:

$$w^\ast = (X^TX)^{-1}X^Ty$$

Here y is a column vector of all $y_i's$ of the training data, and X is the $n * (d+1)$ matrix where each row contains the entries of a single datapoint (and first column is all ones).

#### Deriving $w^\ast$

Below is the steps for deriving the equation for $w^\ast$. We start from the error function in the form above:

$$E_{in}(h) = \frac{1}{N} \sum_{i=1}^N (h(x_i)-y_i)^2$$
 
As any $ h\in H $ is completely defined by some vector of weights w, we can replace h(x):
    
$$E_{in}(w) = \frac{1}{N} \sum_{i=1}^N (w^Tx_i-y_i)^2$$
    
This can be rewritten to matrix/vector form. Here $Xw -y =  [x_i^Tw-y_i,..,x_n^Tw-y_n]^T$.
    
$$E_{in}(w) = \frac{1}{N} ||Xw-y||^2$$
    
We wish to minimize $E_{in}$  with respect to w, so we find the gradient consisting of all partial derivatives and set it equal to the 0-vector.

For __a 1 variable, 2 datapoint, example__ $E_{in}(w)$ looks as follows when expanded ($x_{i,j}$ is the i'th feature of the j'th datapoint):

$$ E_{in}(w) = \frac{1}{N}\big((x_{0,1} w_0 + x_{1,1} w_1 -y_1)^2 +(x_{0,2} w_0 + x_{1,2} w_1 -y_2)^2\big)$$

So when using the chain rule to compute the gradient for the expanded $E_{in}(w)$ above we get:


\begin{equation} 
       \begin{split}
        \nabla E_{in}(w) &=  \frac{1}{N}{\begin{bmatrix}
           \frac{\partial E_{in}}{\partial w_0} \\
           \frac{\partial E_{in}}{\partial w_1} \\
         \end{bmatrix}}\\\\
          &= \frac{1}{N} \begin{bmatrix}
           2(x_{0,1} w_0 + x_{1,1} w_1 -y_1)\cdot (x_{0,1} \cdot 1 + 0 - 0) + 2(x_{0,2} w_0 + x_{1,2} w_1 -y_2) \cdot (x_{0,2} \cdot 1 + 0 - 0)  \\
           2(x_{0,1} w_0 + x_{1,1} w_1 -y_1)\cdot (0 + x_{1,1} \cdot 1 - 0) + 2(x_{0,2} w_0 + x_{1,2} w_1 -y_2) \cdot (0 + x_{1,1} \cdot 1 - 0)  \\
         \end{bmatrix}\\\\
          &= \frac{2}{N} \begin{bmatrix}
           (x_{0,1} w_0 + x_{1,1} w_1 -y_1)\cdot x_{0,1} + (x_{0,2} w_0 + x_{1,2} w_1 -y_2) \cdot x_{0,2}  \\
           (x_{0,1} w_0 + x_{1,1} w_1 -y_1)\cdot x_{1,1} + (x_{0,2} w_0 + x_{1,2} w_1 -y_2) \cdot x_{1,2} \\
         \end{bmatrix}\\\\
         &= \frac{2}{N} \begin{bmatrix} 
         x_{0,1} & x_{0,2}\\
         x_{1,1} & x_{1,2}\\
         \end{bmatrix}
         \begin{bmatrix}
           (x_{0,1} w_0 + x_{1,1} w_1 -y_1) \\
           (x_{0,2} w_0 + x_{1,2} w_1 -y_2) \\
         \end{bmatrix}\\\\
          &= \frac{2}{N} X^T \begin{bmatrix}
          (x_{0,1} w_0 + x_{1,1} w_1 -y_1) \\
          (x_{0,2} w_0 + x_{1,2} w_1 -y_2) \\
         \end{bmatrix}\\\\
         &= \frac{2}{N}X^T(Xw-y)
    \end{split}
    \end{equation}
    

    
Now if we set it equal to the 0-vector, and rewrite a bit we get(ignoring the normalization factor):
    
$$X^TXw =X^Ty$$
    
Now we can solve for w by multiplying with the inverse of $X^TX$:
    
$$w^\ast =(X^TX)^{-1}X^Ty$$

And we have a formula for THE best w for the given data.

$X^\dagger = (X^TX)^{-1}X^T$ is called the pseudo inverse of X, because if you multiply X with it, you get the identity matrix.

TODO skriv mere om inverse matricer

#### Linear Regression for classification

TODO


#### Logistic Regression (classification)
In Logistic Regression we model the the target function as a probability distribution $f(x)=p(y\mid x)$. The output is again a real number but we choose to interpret the output as a probability of belonging in either class 0 or 1, and thus if the output is greater than 0.5 we output "class 1", and we output "class 0" otherwise. This means we can use it for classification, with the nice biproduct of knowing the probability of each class. 

The hypothesis set looks a follows:

$$H = \{h(x) = \sigma(w^Tx) | w \in \mathbb{R}^{d+1}\}$$

Here, $\sigma()$ refers to the sigmoid squishification function:

$$\sigma(x) = \frac{1}{1+e^{-x}}$$ 

Which compress the input x into the range $]0,1[$, such that $\sigma(x)$ is close to 1 if x is a large positive number, and close to 0 if x is a large negative number. Thus we have:
$$
p(y \mid x, w) = \left \{
\begin{array}{l l}
  \sigma(w^\intercal x)
  & \text{ if } y=1  \\
  1 - \sigma(w^\intercal x)
  & \text { if } y=0
\end{array}
\right.
$$

That is, the probability that $y = 1$ given $x$ and $\theta$
is $\sigma(\theta^\intercal x)$. Similarly, the probability that $y=0$ given $x$ and $\theta$ is $1-\sigma(\theta^\intercal x)$.
#### Training

To train a Logistic Regression model we want to maximize the following probability:

$$\begin{align}
p(D \mid w)
&= \prod_{(x,y)\in D}
  p(y \mid x,w)\\
&= \prod_{(x,y)\in D}
  \sigma(w^\intercal x )^{y}
  (1-\sigma(w^\intercal x))^{1-y}
  \end{align}$$

That is, we wish to find the w that maximize the probability of seeing D given w. We call this a __maximum likelihood estimate of w(MLE)__. The first equality comes from the assumption that all datapoints of D are sampled independently according to some unknown input distribution $P(X)$, and the second from the fact that either $\sigma(w^\intercal x )^{y}$ or $(1-\sigma(w^\intercal x))^{1-y}$ must be 1 depending on the value of y (y is 1 or 0). $p(D \mid w)$ is called the _likelihood_ of D given w.

Instead of doing a MLE on the likelihood above, we equivalently _minimize_ the __negative log likelihood(NLL)__ instead:

$$\begin{align}
\mathrm{NLL}(D\mid w)
&= -\log{p(D\mid w)}\\
&=- \sum_{i=1}^n
y_i \ln(\sigma(w^\intercal x_i)) +
(1-y_i) \ln(1-\sigma(w^\intercal x_i))
\end{align}$$

This is done for the following reasons: 1) Computing very small probabilities can cause numerical issues (underflow). 2) Logs are monotone increasing/decreasing. 3) Products turn into sums using logarithm rule. Notice that one of the terms in the sum is always 0.

The gradient of NNL is:
$$
\nabla \mathrm{NLL}(D \mid \theta)
= \frac{\partial \mathrm{NLL}}{\partial \theta}
= -X^\intercal(Y-\sigma(X\theta))
$$

TODO vis hvordan gradienten beregnes http://rasbt.github.io/mlxtend/user_guide/classifier/LogisticRegression/

And the final in-sample error that we use for minimization is the normalized NLL:

$$E_{in}(w) = \frac{1}{N} NLL(D\mid w)$$

This error measure does not allow an easy optimal analytical solution like in the case of Linear Regression, so we mimimize it instead using __gradient descent__. 

TODO GRADIENT DESCENT

#### One vs all

Logistic Regression classification as described above is a binary classifier. If we wish to allow more classes we can use the one vs all technique, where the problem is converted into K binary classification problems. K classifiers are then trained where all datapoints but those belonging to the current class are marked as negative samples. Predicting is then a matter of predicting on all models, and returning the class which returns the highest probability. An alternative is to use __softmax__.

#### Softmax

Softmax is a generalization of logistic regression where instead of only 2 classes we now have K. This means y is no longer a column vector, but a $n x K$ matrix containing all 0's exept for one "1" per row, indicating which class the datapoint belongs to. 

Instead of the sigmoid function we now use the Softmax function, which takes a vector of length K and outputs a vector of length K. The output vector sums to one, and can be interpreted as probabilities of each of the K classes.

$$
\textrm{softmax}(x)_j =
\frac{e^{x_j}}
{\sum_{i=1}^K e^{x_i}}\quad
\textrm{ for }\quad j = 1, \dots, K.
$$

w now becomes a $d+1 x K$ matrix, so that we have d + 1 parameters (+1 for bias) for each of the K classes (parameters for class c is column c), meaning that for each row $w = [w_1,...,w_K]$ and then we can make our signal as a linear combination as before:

$$
p(y \mid x,w) =
\textrm{softmax}(w^\intercal x) =
 \left \{
\begin{array}{l l}
 \textrm{softmax}(w^\intercal x)_1 & \text{ if } y = e_1,  \\
 \vdots & \\
 \textrm{softmax}(w^\intercal x)_K & \text { if } y = e_K.
\end{array}
\right.
$$



### Non-linear transforms

The models above are called "linear", because the of the linearity of the signal:

$$w^T x = \sum_{i=0}^d w_i x_i$$

The above sum is linear in both $w_i$ and $x_i$, but since the $x_i$'s are just constants in this situation, modifying them does not break the linearity in $w_i$. The linearity in the variables $w_i$ is what makes these models _linear_. We can exploit this to transform the data nonlinearly. A classic example is shown below:

![title](imgs/nonlintransforms.png)

The data is transformed from X space into Z space by the transformation $\phi$ which just computes the squares. In Z space, we can see that the data becomes linearly separable. So we can use this "new" training data and train a linear model, and when we get a new datapoint we transform it similarly before prediction. 