# Notebook 4: Logistic Regression

### Machine Learning Basic Module
Florian Walter, Tobias Jülg, Pierre Krack

Please obey the following implementation and submission guidelines.

## General Information About Implementation Assignments
We will use the Jupyter Notebook for our implementation exercises. The task description will be provided in the notebook. The code is also run in the notebook. However, the implementation itself is done in additional files which are imported in the notebook. Please do not provide any implementation that you want to be considered for correction in this notebook, but only in python files in the marked positions. A content of a python file could for example look similar as shown below:
```python
def f():
    ########################################################################
    # YOUR CODE
    # TODO: Implement this function
    ########################################################################
    pass
    ########################################################################
    # END OF YOUR CODE
    ########################################################################
```
To complete the exercise, remove the `pass` command and only use space inside the `YOUR CODE` block to provide a solution. Other lines within the file may not be changed in order to deliver a valid submission.


### Imports

In [1]:
%reload_ext autoreload
%autoreload 2

import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt

width = 9.5
plt.rcParams['figure.figsize'] = [width, width / 1.618] 
plt.rcParams['figure.dpi'] = 100
UTNRED = "#f5735f"
UTNBLUE = "#0087dc"
mpl.rcParams['path.simplify'] = True

## Logistic Regression

The discrimination function of logistic regression classification looks very similar to the Perceptron that we looked at in our first exercise:

$$\sigma(w^Tx + b)$$

with the features $x\in\mathbb{R}^d$ with dimensionality $d\in\mathbb{N}$, the weights $w\in\mathbb{R}^d$, the bias $b\in\mathbb{R}$ and

$$\sigma(a) = \frac{1}{1+e^{-a}}$$

The *Sigmoid* function $\sigma: \mathbb{R}\rightarrow (0, 1)$ is shown in the plot in the cell below. The difference to the Perceptron is the activation function: Instead of a step function from -1 to 1, the sigmoid is a smooth transition from 0 to 1, which can also be interpreted as a probability.

In [None]:
%matplotlib inline
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

values = np.arange(-10, 10, 0.1)
plt.plot(values, sigmoid(values), color=UTNBLUE)
plt.xlabel('a')
plt.ylabel('$\sigma(a)$')
plt.title('Sigmoid Function')
plt.show()

As you will see later, the sigmoid function is the special case of the softmax function which is usually used as the last layer in deep neural networks for classification problems.
Thus, from a deep learning perspective, this model can already be seen as a neural network with a single hidden layer.
However, in this exercise we want to look at this from the probabilistic perspective and figure out why the sigmoid function (also called the logistic function) makes sense from coming from probability theory.

Let's assume that we have two classes K = {0, 1}. $y$ denotes the class and $x$ is the feature variable.

> **Task 1** Show that the posterior probability for class 1 given features $x$, ($p(y=1|x)$) is equal to $\sigma(a)$ with
> 
> $$a = \ln \frac{p(x|y=1)p(y=1)}{p(x|y=0)p(y=0)}$$
> 
> Hint: Bishop 4.2 might help you to get started.

Now, given that $p(y=1|x) = \sigma(a)$ with $a$ from above, we would like to show $a$ can be represented as $w^Tx + b$.
We assume Gaussian likelihood $\mathcal{N}(x|\mu_c, \Sigma)$ with class means $\mu_c$ and the same covariance matrix $\Sigma$ for all classes. We do not fix our priors and denote them with $p(y=c)$. Like above, assume that we only have two classes $c\in\{0, 1\}$.

$$p(x|y=c) = \mathcal{N}(x|\mu_c, \Sigma) = \frac{1}{(2\pi)^{d/2}|\Sigma|^{1/2}} \exp\left\{ -\frac{1}{2} (x - \mu_c)^T \Sigma^{-1} (x - \mu_c) \right\}
$$


> **Task 2** Given that assumption, show that the following equality holds: 
>
> $$p(y=1|x) = \sigma(w^Tx + b)$$
>
> with
> 
> $$
 \begin{aligned}
 w &= \Sigma^{-1} (\mu_1 - \mu_0)\\
 b &= -\frac{1}{2} \mu_1^T \Sigma^{-1} \mu_1 + \frac{1}{2} \mu_0^T \Sigma^{-1} \mu_0 + \ln \frac{p(y=1)}{p(y=0)}
 \end{aligned}
$$



$$a = \ln \frac{p(x|y=1)p(y=1)}{p(x|y=0)p(y=0)} = ...$$





Since both $w$ and $b$ only include parameters which are depended on the data, we can also directly learn $w$ and $b$.
With the bias rewriting trick we can include the bias in the weight and simply to $\sigma(w^Tx)$ where we only have to learn $w$.

To train the model an a 2-class dataset $\mathcal{D} = \{(x_i, y_i)\}_{i=1}^{N}$ ($x_i\in \mathbb{R}^d$, $y_i\in \{0, 1\}$) we have to maximize the joint probability over the whole dataset:

$$
p(y|w) = \prod_{i=1}^{N} \sigma(w^Tx_i)^{y_i} (1- \sigma(w^Tx_i))^{1-y_i}
$$

Using the negative log-likelihood we get an easy-to-differentiate error function which we can optimize. This function is called the binary cross-entropy as it expresses the entropy of the probability distribution:

$$E(w) = -\ln p(y|w) = -\sum_{i=1}^{N}\left( y_i\ln(\sigma(w^Tx_i)) + ({1-y_i})\ln(1- \sigma(w^Tx_i))\right)$$


> **Task 3** Assume that $\mathcal{D}$ is a linearly separable dataset for 2-class classification, i.e. there exists
> a vector w such that $\sigma(w^Tx)$ separates the classes. Show that magnitute of the maximum likelihood parameter $w$
> of a logistic regression model approaches infinity ($||w||\rightarrow \infty$). Assume that $w$ contains the bias term.
>
> Hint: You can use the fact that $\sigma(a) \in (0, 1)$ for all $a\in\mathbb{R}$.
> 
> How can we modify the training process to prefer a $w$ of finite magnitude?

## Properties of the Sigmoid Function

### Derivative of the Sigmoid Function

> **Task 4** Show that the derivative of the Sigmoid function $\sigma(a) = \frac{1}{1+e^{-a}}$ can be written as
>
> $$\frac{\partial\sigma(a)}{\partial a} = \sigma(a)(1-\sigma(a))$$

Give your solution as tex code using in the cell below:



$$\frac{\partial \sigma(a)}{\partial a} = ... $$






### Deepening: Symmetrical Properties of the Sigmoid Function (optional)

Show that the following equality holds for the Sigmoid function:

$$\sigma(-a) = 1 - \sigma(a)$$

Why does this equation hold, what is the intuition behind it?