# Lecture 6

## Discussion of Student Feedback

In [None]:
IRdisplay::display_html('<iframe width="640" height="360" src="https://tube.switch.ch/embed/88263ccb" frameborder="0" allow="fullscreen" allowfullscreen></iframe>')

## Gradient Descent

In [None]:
IRdisplay::display_html('<iframe width="640" height="360" src="https://tube.switch.ch/embed/4cad6014" frameborder="0" allow="fullscreen" allowfullscreen></iframe>')

In the following cell we define the function `gradient_descent` as discussed in
the video. We will make use of the function `auto_diff` from the `ADtools`
library to compute all partial derivatives with respect to (`wrt`) the
parameters.

In [None]:
library(ADtools)
gradient_descent <- function(f, params, fix = list(),
                             learning_rate = 0.01,
                             tol = 1e-6,
                             maxsteps = 10^3,
                             show = F) {
  history <- rep(0, maxsteps)
  for (i in 1:maxsteps) {
    df <- auto_diff(f, at = append(params, fix), wrt = names(params))
    if (show) print(df@x)
    history[i] <- df@x
    delta <- learning_rate * as.numeric(df@dx)
    params <- relist(unlist(params) - delta, params)
    if (max(abs(delta)) < tol) break
  }
  append(params, list(history = history))
}

### A Very Simple 2-Dimensional Example

Let us test our gradient descent function with the following simple example of a
function $f(\beta) = (\beta_1 - 1)^2 + (\beta_2 - 2)^2$ which has its minimum at
$\hat\beta_1 = 1$ and $\hat\beta_2 = 2$ and can be written in
code as `sum((params - c(1, 2))^2)`.

In [None]:
f <- function(params) sum((params - c(1, 2))^2)
result <- gradient_descent(f, params = list(params = c(1, 1)))
result$params

We see that gradient descent found roughly the minimum approximately.
It may not have found the minimum perfectly, because gradient descent typically
fluctuates around the minimum, if the learning rate remains constant.

### Linear Regression
Let us now prepare a linear regression example. We will generate the data and
perform the fit with 3 predictors but without an intercept.

In [None]:
n <- 10^3
p <- 3
X <- matrix(rnorm(n * p), nrow = n)
params <- rnorm(p)
Y <- X %*% params + 0.1*rnorm(n)

Our training set consists of `X` and `Y`.

In the following cell we define our loss function for linear regression.
We do not include the variance term $\frac1{2\sigma^2}$ in the scaling, because
this would only change the scaling of the loss function, but not the point where
it has its minimum. Note, how we implement the sum over training examples with
vectors and matrices, i.e. $\frac1n\sum_{i=1}^n(y_i - x_i^T\beta)^2 =
\mathrm{mean}(Y - X \beta)$, where $Y$ is the vector of responses and $X$ is a
matrix with rows containing the input points.

In [None]:
lm_loss <- function(params, X, Y) mean((Y - X %*% params)^2)
result <- gradient_descent(f = lm_loss,
                           params = list(params = rnorm(p)),
                           fix = list(X = X, Y = Y),
                           show = T)
result$params

Because we used the `show = T` argument, our `gradient_descent` function shows
how the loss function is decreasing as gradient descent progresses. You may need
to scroll down in the output to see the result.

Let us compare the result we obtained with the standard function of R.
We use `-1` in the formula to exclude the intercept.

In [None]:
coef(lm(Y ~ X - 1))

Pretty close to our gradient descent result...

### Logistic Regression

We do the same now for logistic regression. We fit again without intercept.

First we prepare the data.

In [None]:
logistic <- function(x) 1/(1 + exp(-x))
Y <- logistic(-X %*% params) > runif(n)

The response `Y` above uses 0/1 coding, which we transform with `(2*Y - 1)`
to -1/1 coding to use the simple formula from the slides (see also Exercise
Q1.).

In [None]:
lr_loss <- function(params, X, Y) -sum(log(logistic((2*Y - 1) * X %*% params)))
result <- gradient_descent(f = lr_loss, learning_rate = 1e-3,
                           params = list(params = rnorm(p, 1)),
                           fix = list(Y = Y, X = X))
result$params

Again we can compare to the result obtained with the standard method to see that
we got pretty close.

In [None]:
coef(glm(Y ~ X - 1, family = "binomial"))

### Multinomial Logistic Regression
In the following cell I use again the multivariate normal distribution from the
`MASS` library to sample some input points from multivariate normal
distributions.

In [None]:
set.seed(2)
library(MASS)
x1 <- mvrnorm(30, c(.5, 1), .3*diag(2))
x2 <- mvrnorm(30, c(.5, -1), .3*diag(2))
x3 <- mvrnorm(30, c(-1, 0), .3*diag(2))
x <- rbind(x1, x2, x3)
y <- c(rep(1, 30), rep(2, 30), rep(3, 30))
plot(x, col = y, xlim = c(-2.5, 2.5), ylim = c(-2.5, 2.5))

In the following cell we define first the conditional probability distribution
of multinomial logistic regression. Both `X` and `params` can be matrices. The
matrix `X` would be our standard matrix with different input points as rows.
The matrix `params` is constructed from the vectors $\beta_1, \ldots, \beta_K$ as
columns. Thus, the matrix product `X %*% params` returns a matrix from which we can
compute the conditional probabilities of the different classes (in the different
columns) for every input vector (in the different rows).
In the second line we define the Bayes classifier function for input `X` and
parameters `params`, which computes the conditional probabilities for all inputs
and than iterates over all rows to pick the index with the maximal value.
Finally, we define the loss function for multiple logistic regression.
We compute the first term $\sum_{i=1}^n x_i^T\beta_{y_i}$ of the loss function in a
somewhat funny way: $\sum_{i=1}^n \sum_{k=1}^Kx_i^T\beta_kI(y_i = k)$ where
$I(y_i = k) = 1$ if the $i$-th training sample is in class $k$ and $I(y_i = k) =
0$ otherwise, because
we can write this easily in matrix notation.
The function `model.matrix(~factor(y) - 1)` computes all the necessary zeros and ones.
You may want to look at the output of `model.matrix(~ factor(c(1, 2, 3, 2)) - 1)`
to understand this better.

In [None]:
mr_conditional <- function(X, params) exp(X %*% params)/rowSums(exp(X %*% params))
mr_bayes_classifier <- function(X, params) apply(mr_conditional(X, params), 1, which.max)
mr_loss <- function(params, X, Y) {
    x_times_params <- X %*% params
    label_matrix <- model.matrix(~ factor(Y) - 1)
    1/length(Y) * (-sum(x_times_params * label_matrix) + sum(log(rowSums(exp(x_times_params)))))
}

Now we define an initial parameter value for the 3 beta vectors and run 10
times 5 steps of gradient descent and plot the intermediate values of the
parameters vectors and the classification results to see how gradient descent
progresses.

In [None]:
params <- matrix(rnorm(3*2), nrow = 2)
for (i in 1:10) {
    params <- gradient_descent(f = mr_loss, learning_rate = 1e-1,
                               maxsteps = 5,
                               params = list(params = params),
                               fix = list(X = x, Y = y))$params
    pred <- mr_bayes_classifier(x, params)
    plot(x, col = pred, xlim = c(-2.5, 2.5), ylim = c(-2.5, 2.5))
    points(x[y != pred,], pch = 4)
    arrows(rep(0, 3), rep(0, 3), params[1,], params[2,], col = 'red')
}

After only 50 steps, gradient descent will not have converged, as we can see
when we plot the learning curve of 500 iterations of gradient descent.

In [None]:
params <- matrix(rnorm(3*2), nrow = 2)
result <- gradient_descent(f = mr_loss, learning_rate = 1e-1,
                            maxsteps = 500,
                            params = list(params = params),
                            fix = list(X = x, Y = y))
plot(result$history, ylab = "training loss", xlab = "iteration")

The library `nnet` computes multinomial logistic regression in a different way
and finds different values for the coefficients, but it finds the same
predictions as the ones we found with gradient descent.

In [None]:
library(nnet)
pred2 <- predict(multinom(as.factor(y) ~ x - 1), type = "class")
mr_bayes_classifier(x, result$params) == pred2

### Regularization

In the following cell we define a function that takes any loss function and
returns a regularized loss function. Note that this function is somewhat
special. In contrast to other functions that return a value or a data frame,
this function returns a new function.

In [None]:
regularized_loss <- function(loss, lambda0 = 0, lambda1 = 0) {
    function(params, X, Y) {
        loss(params, X, Y) + lambda0 * sum(sqrt(params^2)) + lambda1 * sum(params^2)
    }
}

We will look at regularization again in a linear regression setting.

In [None]:
n <- 10^3
p <- 3
X <- matrix(rnorm(n * p), nrow = n)
params <- rnorm(p)
Y <- X %*% params + 0.1*rnorm(n)

Play in the following cell with different values for the regularization
hyper-parameters `lambda0` and `lambda1` and observe how the results are
changing.

In [None]:
result <- gradient_descent(regularized_loss(lm_loss, lambda1 = 5),
                           learning_rate = 1e-4,
                           params = list(params = rnorm(p)),
                           fix = list(X = X, Y = Y))
result$params

You are now ready to do the
[quiz](https://moodle.epfl.ch/mod/quiz/view.php?id=1104947).

## Exercises

### Conceptual

**Q1.** The formulas of logistic regression can be written in different forms
depending on how the two response classes are encoded. We will use the notation
$\sigma(x) = \frac1{1 + e^{-x}}$ for the logistic function.
First, we look at the case where $y \in \{0, 1\}$.

(a) Prove that $\mathrm{Pr}(Y = 0 | X = x, \beta) = \frac1{1 + e^{x^T\beta}} =
\sigma(-x^T\beta)$ if $\mathrm{Pr}(Y = 1 | X = x, \beta) = \sigma(x^T\beta)$.

(b) Prove that we can write the negative log-likelihood loss as $\mathcal
L(\beta) = -\sum_{i = 1}^ny_i\log(\sigma(x_i^T\beta)) + (1 - y_i)\log(1 -
\sigma(x_i^T\beta))$

(c) Now we study the case where $y\in\{-1, 1\}$. Prove that we can write in this
case $\mathrm{Pr}(Y = y | X = x, \beta)  = \sigma(yx^T\beta)$.

(d) Prove that we can write in this case the negative log-likelihood loss as
$\mathcal L(\beta) = -\sum_{i = 1}^n\log(\sigma(y_ix_i^T\beta))$.

### Applied

**Q2.** Assume the noise in a linear regression setting comes from a Laplace
distribution, i.e. the conditional probability of the response is given by
$\mathrm{Pr}(Y = y | X = x, \beta) = \frac1{2s}\exp(-\frac{|y - x^T\beta|}{s})$.
For simplicity we assume throughout this exercise that the intercept is 0 and
does not need to be fitted.

(a) Generate a training set with 30 points, 3 predictors with values uniformly
distributed in $[0, 1]$ and responses with Laplace distributed noise with scale
parameters $s = 0.3$.
You can sample `n` noise values with mean zero and scale parameters $s$ of the
Laplace distribution with the following function

In [None]:
rlapl <- function(n, s = 1) {
    p <- runif(n) - .5
    s*log(1 - 2*sqrt(p^2))*sign(p)
}
error <- rlapl(30, s = 0.3)

(b) Compute with paper and pencil the negative log-likelihood loss.

(c) Write the R-code to compute the loss on the training set for a given
parameter vector. You can complete the function in the cell below.
Use `sqrt(x^2)` to compute $|x|$.

In [None]:
laplace_lr_loss <- function(params, X, Y) {

}

(d) Perform gradient descent on the training set. Plot the learning curve to see
whether gradient descent has converged. If you see large fluctuations at the end
of training, decrease the learning rate. If the learning curve is not flat at
the end, increase the maximal number of steps.

(e) Estimate the coefficients with the `lm` function. Hint: do not forget that
we fit without intercept.

(f) Compare the test error of the results in (d) and (e) for a very large test
set, e.g. 10^6 samples.