# Deep Learning for Computer Vision

---

**Goethe University Frankfurt am Main**

Winter Semester 2022/23

<br>

## *Assignment 2 (Regularization)*

---

**Points:** 10<br>
**Due:** 10.11.2022, 10 am<br>
**Contact:** Matthias Fulde ([fulde@cs.uni-frankfurt.de](mailto:fulde@cs.uni-frankfurt.de))<br>

---

**Your Name:** Tilo-Lars Flasche

<br>

<br>

## Table of Contents

---

- [1 L1 Regularization](#1-L1-Regularization-(5-Points))
  - [1.1 Implementation](#1.1-Implementation-(3-Points))
  - [1.2 Explanation](1.2-Explanation-(2-Points))
- [2 L2 Regularization](#2-L2-Regularization-(5-Points))
  - [2.1 Implementation](#2.1-Implementation-(3-Points))
  - [2.2 Explanation](#2.2-Explanation-(2-Points))


<br>

## Setup

---

In this notebook we use the only the **NumPy** library.

We import definitions of regularizers from the `regularization.py` module and enable autoreload, so that the imported functions are automatically updated whenever the code is changed.

In [1]:
import numpy as np

from regularization import L1_reg, L2_reg

%load_ext autoreload
%autoreload 2

<br>

## Exercises

---

### 2 L1 Regularization (5 Points)

---

In this exercise we want to implement **L1 regularization**. Here, the regularizer is the absolute value of the model's weights, defined as

$$
    R(W) = \sum_{i=1}^D \sum_{j=1}^K \vert W_{i,j} \vert.
$$

In order to control the effects of the regularization term, we introduce the regularization strength $\lambda$ as a hyperparameter. The complete loss for our model is then the sum of the data loss $\mathcal{L}$ and the regularization loss $R$, that is

$$
    J(W) = \mathcal{L}(W) + \lambda R(W).
$$


<br>

### 1.1 Implementation (3 Points)

---

Complete the definition of the `L1_reg` function in the `regularization.py` file.

The function takes a parameter matrix $W$ of shape $(D+1, K)$, where $K$ is the number of categories and $D$ is the dimension of the inputs. The last row is assumed to be the bias. The second parameter is the regularization strength.

The function should return a tuple $(R, dW)$ with the regularization loss $R$, computed only for the weights and not the bias, and the gradient of the loss $dW$ with respect to the parameters. So the loss $R$ is a scalar and $dW$ has the same shape as $W$.

Use only vectorized NumPy operations for the implementation. No loops are allowed.

<br>

#### Test 1.1.1

To test your implementation, you can run the following code.

In [15]:
# Define dummy parameters.
W = np.array([
    [ 1.2,  3.6,  8.1],
    [ 4.0, -1.0,  3.6],
    [-9.6,  2.5, -6.3],
    [ 3.5, -7.2, -2.0]
])

# Compute regularization loss.
R, dW = L1_reg(W, 0.5)

In [16]:
# Compare loss.
loss_equal = abs(R - 19.95) < 1e-5

# Compare derivatives.
grad_equal = np.array_equal(dW, np.array([
    [ 0.5,  0.5,  0.5],
    [ 0.5, -0.5,  0.5],
    [-0.5,  0.5, -0.5],
    [ 0.0,  0.0,  0.0]
]))

# Show results.
print(loss_equal and grad_equal)

True


##### Answer

*Write your answer here.*

First let's look at how R is calculated. It is the sum of all components of the weight matrix except the last row.
$$ R = w_{11} + ... + w_{1k} + w_{21} + ... w_{2k} + ... + w_{d1} + ... + w_{dk} $$

If we derive this sum with respect to $w_{ij}$ the derivative is just one.

$$ \frac{\partial R}{\partial w_{ij}} = \begin{cases} 1 & w_{ij} \ge 0 \\ -1 & w_{ij} < 0 \end{cases} $$

The derivative of $R$ with respect to $W$ is a matrix where each components is one (if $w_{ij} \ge 0$) or minus one (if $w_{ij} < 0$) 

$$ \frac{\partial R}{\partial W} = \begin{pmatrix}
\frac{\partial R}{\partial w_{11}} & \dots & \frac{\partial R}{\partial w_{1k}} \\
\vdots & \ddots & \vdots \\
\frac{\partial R}{\partial w_{d1}} & \dots & \frac{\partial R}{\partial w_{dk}}
\end{pmatrix} $$

If we now multiply $\frac{\partial R}{\partial w_{ij}}$ with $r$ the result is just $r$

$$ r \cdot \frac{\partial R}{\partial w_{ij}} = \begin{cases} r & w_{ij} \ge 0 \\ -r & w_{ij} < 0 \end{cases} $$

If we multiply $\frac{\partial R}{\partial W}$ with $r$ we get a matrix where each components is $r$ or $-r$

$$ r \cdot \frac{\partial R}{\partial W} = r \cdot \begin{pmatrix}
\frac{\partial R}{\partial w_{11}} & \dots & \frac{\partial R}{\partial w_{1k}} \\
\vdots & \ddots & \vdots \\
\frac{\partial R}{\partial w_{d1}} & \dots & \frac{\partial R}{\partial w_{dk}}
\end{pmatrix}
= \begin{pmatrix}
r \cdot \frac{\partial R}{\partial w_{11}} & \dots & r \cdot \frac{\partial R}{\partial w_{1k}} \\
\vdots & \ddots & \vdots \\
r \cdot \frac{\partial R}{\partial w_{d1}} & \dots & r \cdot \frac{\partial R}{\partial w_{dk}}
\end{pmatrix}$$

<br>

### 2 L2 Regularization (5 Points)

---

In this exercise we want to implement **L2 regularization**. Here, the regularizer is the squared euclidean distance of the model's weights, defined as

$$
    R(W) = \sum_{i=1}^D \sum_{j=1}^K W_{i,j}^2.
$$

Again, we have the regularization strength $\lambda$ as an additional hyperparameter, controlling by how much we restrict the model's parameters. The complete loss for our model is the sum of the data loss $\mathcal{L}$ and the regularization loss $R$, that is

$$
    J(W) = \mathcal{L}(W) + \lambda R(W).
$$


<br>

### 2.1 Implementation (3 Points)

---

Complete the definition of the `L2_reg` function in the `regularization.py` file.

The function takes a parameter matrix $W$ of shape $(D+1, K)$, where $K$ is the number of categories and $D$ is the dimension of the inputs. The last row is assumed to be the bias. The second parameter is the regularization strength.

The function should return a tuple $(R, dW)$ with the regularization loss $R$, computed only for the weights and not the bias, and the gradient of the loss $dW$ with respect to the parameters. So the loss $R$ is a scalar and $dW$ has the same shape as $W$.

Use only vectorized NumPy operations for the implementation. No loops are allowed.

**Solution:**

First let's look at how R is calculated. It is the sum of all squared components of the weight matrix except the last row.

$$ R = w_{11}^2 + ... + w_{1k}^2 + w_{21}^2 + ... w_{2k}^2 + ... + w_{d1}^2 + ... + w_{dk}^2 $$

If we derive this sum with respect to $w_{ij}$ the derivative is two times $w_{ij}$.

$$ \frac{\partial R}{\partial w_{ij}} = 2 w_{ij} $$

The regularization term does not depend on the bias values, so the derivative of $R$ with respect to the $bias$ is zero

$$ \frac{\partial R}{\partial w_{D+1,j}} = 0 $$

The derivative of $R$ with respect to $W$ is a matrix where each components is two times its original value

$$ \frac{\partial R}{\partial W} = \begin{pmatrix}
\frac{\partial R}{\partial w_{11}} & \dots & \frac{\partial R}{\partial w_{1k}} \\
\vdots & \ddots & \vdots \\
\frac{\partial R}{\partial w_{d1}} & \dots & \frac{\partial R}{\partial w_{dk}}
\end{pmatrix}
=
\begin{pmatrix}
2 w_{11} & \dots & 2 w_{1k} \\
\vdots & \ddots & \vdots \\
2 w_{d1} & \dots & 2 w_{dk}
\end{pmatrix} $$

If we now multiply $\frac{\partial R}{\partial w_{ij}}$ with $r$ the result is $2rw_{ij}$

$$ r \cdot \frac{\partial R}{\partial w_{ij}} = 2rw_{ij} $$

If we multiply $\frac{\partial R}{\partial W}$ with $r$ we get a matrix where each components is $r$ or $-r$

$$ r \cdot \frac{\partial R}{\partial W} = r \cdot \begin{pmatrix}
\frac{\partial R}{\partial w_{11}} & \dots & \frac{\partial R}{\partial w_{1k}} \\
\vdots & \ddots & \vdots \\
\frac{\partial R}{\partial w_{d1}} & \dots & \frac{\partial R}{\partial w_{dk}}
\end{pmatrix}
=
r \cdot \begin{pmatrix}
2 w_{11} & \dots & 2 w_{1k} \\
\vdots & \ddots & \vdots \\
2 w_{d1} & \dots & 2 w_{dk}
\end{pmatrix}
=
\begin{pmatrix}
2r w_{11} & \dots & 2r w_{1k} \\
\vdots & \ddots & \vdots \\
2r w_{d1} & \dots & 2r w_{dk}
\end{pmatrix}$$

<br>

#### 2.1.1 Test

To test your implementation, you can run the following code.

In [21]:
# Define dummy parameters.
W = np.array([
    [ 1.2,  3.6,  8.1],
    [ 4.0, -1.0,  3.6],
    [-9.6,  2.5, -6.3],
    [ 3.5, -7.2, -2.0]
])

# Compute regularization loss.
R, dW = L2_reg(W, 0.5)

In [22]:
# Compare loss.
loss_equal = abs(R - 124.035) < 1e-5

# Compare gradient.
grad_equal = np.array_equal(dW, [
    [ 1.2,  3.6,  8.1],
    [ 4.0, -1.0,  3.6],
    [-9.6,  2.5, -6.3],
    [ 0.0,  0.0,  0.0]
])

# Show results.
print(loss_equal and grad_equal)

True


<br>

### 2.2 Explanation (2 Points)

---

Briefly describe in your own words how the L2 regularization affects the parameters of the model.

<br>

##### Answer

*Write your answer here.*

L2 regularization punishes large weights in a non-linear way. If the weight doubles in contributes four time more than before to the regularization error. This leads to small weights when we train the model and backpropagate the error. Weights can not become zero when using L2 regularization.

As a numerical example, defining an input vector $x=(1,1,1,1)$ and two weight vectors $w_1 = (1,0,0,0)$ and $w_2 = (0.25, 0.25, 0.25, 0.25)$ then
$$ w_1^Tx = w_2^T x = 1 $$
but $w_2$ would be the preferred solution when L2 regularization is applied, since
$$ 1 = \sum\limits_{i=1}^4 w_{1,i}^2 > \sum\limits_{i=1}^4 w_{2,i}^2 = 0.25 $$