# Lecture 7

## Solving the XOR Problem without Feature Engineering

In [None]:
IRdisplay::display_html('<iframe width="640" height="360" src="https://tube.switch.ch/embed/cc9cf01e" frameborder="0" allow="fullscreen" allowfullscreen></iframe>')

Let us create again some XOR-problem data.

In [None]:
library(MASS)
x1 <- mvrnorm(30, c(1.5, 1), .2*diag(2))
x2 <- mvrnorm(30, c(-1, -1), .2*diag(2))
x3 <- mvrnorm(30, c(-1, .8), .2*diag(2))
x4 <- mvrnorm(30, c(.9, -1), .2*diag(2))
data <- data.frame(X = rbind(x1, x2, x3, x4), Y = c(rep(1, 60), rep(0, 60)))
plot(data$X.1, data$X.2, col = data$Y + 2)

This is how we solved it two weeks ago.

In [None]:
relu <- function(x) ifelse(x > 0, x, 0)
features <- t(matrix(c(1, -1, 1, -1,
                       1, 1, -1, -1), ncol = 2))
newdata <- data.frame(H = relu(as.matrix(data[, c("X.1", "X.2")]) %*% features))
newdata$Y <- data$Y
fit <- glm(Y ~ ., newdata, family = "binomial")
pred <- predict(fit, type = "response") > .5
plot(data$X.1, data$X.2, col = pred + 2)
misclassified <- pred != data$Y
points(data[misclassified, 1:2], pch = 4)
arrows(rep(0, 4), rep(0, 4), features[1,], features[2,], col = 'red')

And last week we defined gradient descent in the following way.

In [None]:
library(ADtools)
gradient_descent <- function(f, params, fix = list(),
                             learning_rate = 0.01,
                             tol = 1e-6,
                             maxsteps = 10^3,
                             show = F) {
  history <- rep(0, maxsteps)
  for (i in 1:maxsteps) {
    df <- auto_diff(f, at = append(params, fix), wrt = names(params))
    if (show) print(df@x)
    history[i] <- df@x
    delta <- learning_rate * as.numeric(df@dx)
    params <- relist(unlist(params) - delta, params)
    if (max(abs(delta)) < tol) break
  }
  append(params, list(history = history))
}

Although [the relu activiation
function](https://en.wikipedia.org/wiki/Rectifier_(neural_networks)) is more
commonly used in practice, we will use the soft-relu function here (also known
as the softplus or smooth-relu). The reason is simply that ADtools cannot handle
the relu function but it can handle its soft version. Later in this notebook we
will also look at other libraries to fit neural networks that can handle the
relu function. In the following plot you can see that the relu and the soft-relu
are pretty similar to each other.

In [None]:
soft_relu <- function(x) log(1 + exp(x))
plot(c(), ylim = c(-.1, 5), xlim = c(-5, 5))
curve(relu, from = -5, to = 5, add = T)
curve(soft_relu, from = -5, to = 5, col = 'blue', add = T)

Let us define now the neural network `nn` that takes `X` as input, computes the
feature representation `soft_relu(X %*% w)` and produces the output by
multiplying the feature representation with the coefficients beta.

In [None]:
nn <- function(w, beta, X) soft_relu(X %*% w) %*% beta

To run logistic regression with this neural network we define very similar
functions as last week, e.g. the loss function `nn_loss` differs from last
week's logistic regression loss only by the fact that the input to the logistic
function is `(2*Y - 1) * nn(w, beta, X)` instead of `(2*Y - 1) * X %*% params`.
Similarly we can define the Bayes classifier for the neural network.

In [None]:
logistic <- function(x) 1/(1 + exp(-x))
nn_loss <- function(w, beta, X, Y) -mean(log(logistic((2*Y - 1) * nn(w, beta, X))))
nn_bayes_classifier <- function(w, beta, X) logistic(nn(w, beta, X)) > .5

In the video it was mentioned that the initialization matters when fitting
neural networks with gradient descent. A common choice for initial guesses is
the "Glorot normal initialization" (named after student Xavier Glorot, who
published with his supervisor an
[article](http://proceedings.mlr.press/v9/glorot10a.html) on this topic).  The
recipe is as simple as constructing the initial guesses from matrices with
normally distributed entries with mean zero and standard deviation
$\sqrt{\frac2{n_\mathrm{in} + n_\mathrm{out}}}$, where $n_\mathrm{in}$ is the
number of rows of the matrix and $n_\mathrm{out}$ is the number of columns of
the matrix.

In [None]:
glorot_normal <- function(n_in, n_out) matrix(rnorm(n_in * n_out)*sqrt(2/(n_in + n_out)), ncol = n_out)
glorot_normal(2, 4)

With this we are ready to apply our first neural network to the XOR problem.
It learns four 2-dimensional feature vectors and performs at the same time
logistic regression.

In [None]:
X <- as.matrix(data[,1:2])
params <- list(w = glorot_normal(2, 4), beta = glorot_normal(4, 1))
result <- gradient_descent(nn_loss, params, learning_rate = 0.1,
                           fix = list(X = X, Y = data$Y),
                           show = T)

Let us plot the results.

In [None]:
pred <- nn_bayes_classifier(result$w, result$beta, X)
plot(X, col = pred + 2)
misclassified <- pred != data$Y
points(X[misclassified,], pch = 4)
arrows(rep(0, 4), rep(0, 4), result$w[1,], result$w[2,], col = 'red')

We see that the results are just as good as with our hand-crafted features.
The learned features are not exactly the same as our hand-crafted ones, but they
point in similar directions.

## From Artificial Neurons to Neural Networks

In [None]:
IRdisplay::display_html('<iframe width="640" height="360" src="https://tube.switch.ch/embed/bd635e5c" frameborder="0" allow="fullscreen" allowfullscreen></iframe>')

There is no code in this section. Before you move over to the quiz, I recommend
you to think about the neural network we used to solve the XOR problem in terms
of number of neurons, number of layers, number of weights and number of biases.
You can now solve the first page of the
[quiz](https://moodle.epfl.ch/mod/quiz/view.php?id=1107127).

## Regression with Neural Networks

In [None]:
IRdisplay::display_html('<iframe width="640" height="360" src="https://tube.switch.ch/embed/291cf020" frameborder="0" allow="fullscreen" allowfullscreen></iframe>')

To include the bias in the matrix notation, we define in the following cell the
function `bias_input` that prepends a column of 1s to any matrix of activities.

In [None]:
bias_input <- function(X) cbind(rep(1, ifelse(is.null(nrow(X)), length(X), nrow(X))), X)
bias_input(matrix(rnorm(20), nrow = 5))

We load again the life expectancy data.
This time we shift and scale the data, because this tends to give better results
with the `glorot_normal` initialization.
You can see the effect of the `scale` function by looking at the mean of the
resulting vector - which is up to numerical errors equal to 0 - and the standard
deviation, which is 1.

In [None]:
data <- na.omit(read.csv(file.path("..", "data", "life_expectancy.csv")))
X <- scale(data$GDP)
Y <- scale(data$LifeExpectancy)
print(mean(X))
print(sd(X))
head(X)

We define now a neural network with 1 input neuron, 8 soft-relu hidden neurons
and 1 output neuron. All neural activity matrices get a column of 1s for the
biases with the `bias_input` function. Because we are in a regression setting we
take this time the mean squared error as loss function.

In [None]:
nn_predict <- function(w1, w2, X) bias_input(soft_relu(bias_input(X) %*% w1)) %*% w2
nn_loss <- function(w1, w2, X, Y) mean((Y - nn_predict(w1, w2, X))^2)
params <- list(w1 = glorot_normal(2, 8), w2 = glorot_normal(9, 1))
result <- gradient_descent(nn_loss, params, fix = list(X = X, Y = Y),
                           learning_rate = 0.1, show = T)

Let us plot everything.

In [None]:
grid <- seq(-1, 6, length = 100)
pred <- nn_predict(result$w1, result$w2, grid)
plot(X, Y, xlab = "Scaled GDP", ylab = "Scaled LifeExpectancy")
lines(grid, pred, col = 'red', lwd = 2)

You can see that the result looks a bit like the one from a spline regression.
The difference to spline regression is that we did not need to select the knots,
but the useful features `soft_relu(bias_input(X) %*% w1)` were found by
adjusting the weights and biases in the matrix `w1`.

You can now solve the second page of the
[quiz](https://moodle.epfl.ch/mod/quiz/view.php?id=1107127).

## Stochastic Gradient Descent

In [None]:
IRdisplay::display_html('<iframe width="640" height="360" src="https://tube.switch.ch/embed/158f48f6" frameborder="0" allow="fullscreen" allowfullscreen></iframe>')

Stochastic gradient descent has as additional hyper-parameter the `batchsize`,
which determines how many training samples are processed for every estimate of
the gradient. The indices of the training samples in the cell below are computed
by sliding through the data set of size `n`. After the last index `n` the first
index of the training set is used again. For example, in a training set with 100
data points and batchsize 32, you can see below the indices of the first,
second and forth batch.

In [None]:
indices <- function(i, batchsize, n) (seq((i - 1) * batchsize, i * batchsize - 1) %% n) + 1
print(indices(1, 32, 100))
print(indices(2, 32, 100))
print(indices(4, 32, 100))

Here is the full code for stochastic gradient descent.

In [None]:
stochastic_gradient_descent <- function(f, params, X, Y,
                                        learning_rate = 0.01,
                                        tol = 1e-6,
                                        batchsize = 32,
                                        maxsteps = 10^3,
                                        show = F) {
  history <- rep(0, maxsteps)
  for (i in 1:maxsteps) {
    idxs <- indices(i, batchsize, length(Y))
    at <- append(params, list(X = X[idxs,], Y = Y[idxs])) # evaluate only at idxs
    df <- auto_diff(f, at = at, wrt = names(params))
    if (show) print(df@x)
    history[i] <- df@x
    delta <- learning_rate * as.numeric(df@dx)
    params <- relist(unlist(params) - delta, params)
    if (max(abs(delta)) < tol) break
  }
  append(params, list(history = history))
}

Let us now run fit the life expectancy data with stochastic gradient descent.
We run for more steps and with a smaller learning rate. Because every step uses
now only a fraction of the full data set and takes thus less time than in full
gradient descent, we can afford to run it for more steps. A smaller learning
rate may be useful to compensate for the larger fluctuations in the estimate of
the gradient.

In [None]:
result2 <- stochastic_gradient_descent(nn_loss, params, X = X, Y = Y,
                                       learning_rate = 0.02, show = T,
                                       maxsteps = 3000)

To compare gradient descent and stochastic gradient descent, we plot the
histories as a function of number of training samples processed. Because
gradient descent processes the full data set in every step we multiply the step
number with the length of the data set `length(X) * seq(10^3)`. In SGD every
step processes only 32 samples.

In [None]:
plot(length(X) * seq(10^3), result$history, log = "y", type = "l", lwd = 2, xlab = "training data processed", ylab = "training loss")
lines(seq(3000)*32, result2$history, col = 'blue')
moving_average <- function(x, n = 5) filter(x, rep(1 / n, n))
lines(seq(3000)*32, moving_average(result2$history, 50), col = 'red')
legend("topright", c("gradient descent", "stochastic gradient descent", "moving average"), lty = 1, col = c("black", "blue", "red"))

You can now solve the third page of the
[quiz](https://moodle.epfl.ch/mod/quiz/view.php?id=1107127).

## Keras
There are nowadays great libraries to solve machine learning problems with
artificial neural networks. Our approach with ADtools is great for small scale
problems, but these libraries work also very well for large scale problems.
Very popular are
[pytorch](https://cran.r-project.org/web/packages/rTorch/index.html),
[tensorflow](https://tensorflow.rstudio.com/) and
[keras](https://keras.rstudio.com/).
In the following you will see how to use keras to solve the XOR problem.
We define once again some training data.

In [None]:
library(MASS)
x1 <- mvrnorm(30, c(1.5, 1), .2*diag(2))
x2 <- mvrnorm(30, c(-1, -1), .2*diag(2))
x3 <- mvrnorm(30, c(-1, .8), .2*diag(2))
x4 <- mvrnorm(30, c(.9, -1), .2*diag(2))
data <- data.frame(X = rbind(x1, x2, x3, x4), Y = c(rep(1, 60), rep(0, 60)))
plot(data$X.1, data$X.2, col = data$Y + 2)

Now load the library, define the network model and append some layers.

In [None]:
library(keras)

nn <- keras_model_sequential() %>%
      layer_dense(units = 4, activation = 'relu', input_shape = c(2)) %>%
      layer_dense(units = 1, activation = 'sigmoid') # append layers
summary(nn)

We made use again of the pipe operator `%>%` that you encountered already in
week 5.

Now we use the function `compile` that produces efficient code that can be used
for fitting afterwards with the loss function `binary_crossentropy` and the
stochastic gradient descent optimizer `optimizer_sgd` with learning rate `lr =
0.1`.

In [None]:
nn %>% compile(
    loss = 'binary_crossentropy',      # the same as our standard loss for logistic regression
    optimizer = optimizer_sgd(lr = .1) # stochastic gradient descent
) # specify loss function and optimization method

Now we fit the neural network to the data.
To compare with our results above, we use here SGD in a funny way: we define the
batchsize to be equal to the size of the training set (= 60); hence we perform
gradient descent.

In [None]:
history <- nn %>% fit(
    as.matrix(data[,1:2]),
    data$Y,
    batch_size = 60,
    epochs = 1000,
    validation_split = 0, # use all data for training, none for validation.
)
plot(history)

pred <- predict(nn, as.matrix(data[,1:2])) > 0.5
plot(data[,1:2], col = pred+2)
misclassified <- pred != data$Y
points(X[misclassified,], pch = 4)

You see that we got comparable results with `keras`.
Because it has so much useful built-in functionality, you will use again `keras`
in the upcoming lectures.

## Exercises

## Conceptual
**Q1**
To get a feeling for the kind of functions that can be fitted with neural networks, we will draw $y$ as a function of $x$ for some values of the weights. It may be helpful to sketch this neural network with the input neuron, the hidden neurons and the output neuron and label the connections with the weights.

(a) Draw in the same figure $a_1^{(1)} = g(w_{10}^{(1)} + w_{11}^{(1)} x)$, $a_2^{(1)}=g(w_{20}^{(1)} + w_{21}^{(1)} x)$ and $\bar y = w_0^{(2)} + w_1^{(2)}a_1^{(1)} +  w_2^{(2)}a_2^{(1)}$ as a function of $x$. Use $w_{10}^{(1)} = 0$, $w_{11}^{(1)} = 1$, $w_{20}^{(1)} = - 2$, $w_{21}^{(1)} = 2$,  $w_0^{(2)} = 1$, $w_1^{(2)} = 2$, $w_2^{(2)} = 1$ and use the rectified linear activation function $g = \mbox{relu}$. At which $x$-values does the slope change? Give the answer in terms of the weights.


(b) Draw a similar graph for $w_{11}^{(1)} < 0$ and $w_{10}^{(1)}/w_{11}^{(1)}<w_{20}^{(1)}/w_{21}^{(1)}$, e.g. with $w_{10}^{(1)} = 0$, $w_{11}^{(1)} = -1$, $w_{20}^{(1)} = 2$, $w_{21}^{(1)} = 2$,  $w_0^{(2)} = 1$, $w_1^{(2)} = 2$, $w_2^{(2)} = 1$.

(c) How does the graph in 1b change, if $w_1^{(2)}$ and $w_2^{(2)}$ become negative?

(d) Let us assume we add more neurons to the same hidden layer, i.e. we have $a_1^{(1)}, \ldots, a_{d^{(1)}}^{(1)}$ activations in the first layer. How would the graph look differently in this case? Draw a sketch and describe in one sentence the qualitative difference.

(e) Let us assume we add instead more hidden layers with relu-activations. Would the graph of this neural network look qualitatively different from the one in 1d?

(f) Show that a neural network with one hidden layer of 3 relu-neurons can perfectly fit any continuous piece-wise linear function of the form
                $$y = \left\{\begin{array}{ll}
                            a_1 + b_1 x & x < c_1 \\
                            a_2 + b_2 x & c_1 \leq x < c_2 \\
                            a_3 + b_3 x & c_2 \leq x
                        \end{array}\right.
                    $$
                    with $c_1 =  \frac{a_1 - a_2}{b_2 - b_1} < c_2 = \frac{a_2 - a_3}{b_3 - b_2}$. Express $a_1, a_2, a_3$ and $b_1, b_2, b_3$ in terms of the network weights. There are multiple solutions; find one of them.

**Q2**
Piecewise linear functions can also be fit with spline regression. Compare neural networks to spline regression. What are the advantages and disadvantages of the two approaches?

**Q3**

![Mean (red) plus/minus standard deviation (blue) of the response wage.scaled as
a function of the predictor age.scaled.](img/mean_plus_std_wage.png)

We can also use neural networks to find prediction intervals. For example one could fit the mean and the standard deviation of the response $y$ as a function of the predictor $x$ with a neural network as in the figure above.

To illustrate this approach, we will use a neural network to parametrize a
normal distribution $p(y) = \frac1{\sqrt{2\pi\sigma_w(x)^2}}\exp(-\frac{(y -
m_w(x))^2}{2\sigma_w(x)^2}$.
Our network will have a one-dimensional input, two hidden neurons and two output
neurons, one for the mean $m_w(x)$ and one for the variance $\sigma_w(x)^2$,
where the subscript $w$ indicates that these functions depend on the weights of
the neural network.

(a) Write explicitly how the mean depends on the input and the weights.
$$m_w(x) = $$

(b) Write explicitly how the variance depends on the input and the weights.

(c) For fitting we would like to use the maximum-likelihood method under the assumption that the response follows a normal distribution. Give the formula for the negative log-likelihood loss function.


## Applied

**Q4** In this exercise you apply the theory you developed in exercise Q3.

(a) Write the R code for the loss function you found in Q3.(c)

(b) Fit a neural network with this loss function to the life expectancy data
set.

(c) Plot the data, the mean and the mean plus-minus one standard deviation as
predicted from the neural network.

**Q5** At the beginning of this lecture, you saw how to solve the XOR problem
with ADtools and later you saw how to solve the exact same task with keras.
In the subsection "Regression with neural networks" you saw an example, where we
used ADtools to fit the life expectancy dataset.
In this exercise you use `keras` to fit a neural network with
1 input neuron 8 hidden neurons with relu activation function and one output
neuron to the life expectancy data set.