# DS-GA 1008: Deep Learning, Spring 2020

# Homework Assignment 1

>He who learns but does not think is lost.
>
>He who thinks but does not learn is in great danger.
>
>Confucius (551 - 479 BC)

## 1. Backprop

Backpropagation or “backward propagation through errors” is a method which calculates the
gradient of the loss function of a neural network with respect to its weights.


### 1.1. Affine Module

The chain rule is at the heart of backpropagation.

Suppose we are given $x ∈ R 2$ and that we use an affine transformation with parameters $W ∈ R 2×2$ and $b ∈ R 2$ defined by:

> $$\mathbf y = \mathbf {W x} + \mathbf b$$ 
> (1)

* (a) Suppose an arbitrary cost function $C(y)$ and that we are given the gradient $\frac {\partial C}{\partial y}$.
  
  Give an expression for $\frac { \partial C}{\partial W}$ and $\frac { \partial C}{\partial b}$ in terms of $\frac { \partial C}{\partial y}$ and $x$ using the chain rule.

#### Solution:

Using the chain rule, we can write an expression for each term as:

> $$\frac {\partial C}{\partial \mathbf W} = \frac {\partial C}{\partial \mathbf y} \frac {\partial \mathbf y}{\partial \mathbf W}$$
>
> where $\frac {\partial \mathbf y}{\partial \mathbf W} = \mathbf x$
>
> |Solution|
> |---|
> |$$\frac {\partial C}{\partial \mathbf W} = \frac {\partial C}{\partial \mathbf y} \mathbf x$$|


> $$\frac {\partial C}{\partial \mathbf b} = \frac {\partial C}{\partial \mathbf y} \frac {\partial \mathbf y}{\partial \mathbf b}$$
>
> where $\frac {\partial \mathbf y}{\partial \mathbf b} = 1$
>
> |Solution|
> |---|
> |$$\frac {\partial C}{\partial \mathbf b} = \frac {\partial C}{\partial \mathbf y}$$|

* (b) If we define a new cost $C_2 (y) = 3 ∗ C(y)$, do we know how our gradients with respect to $W$, $b$ change without knowing the particular form of $C(y)$ or $\frac {\partial C}{\partial y}$?

#### Solution

From (a), we had:
> $$\frac {\partial C}{\partial \mathbf W} = \frac {\partial C}{\partial \mathbf y} \mathbf x$$
>
> $$\frac {\partial C}{\partial \mathbf b} = \frac {\partial C}{\partial \mathbf y}$$

If $C_2 (y) = 3 C(y)$

> $$\frac {\partial C_2}{\partial \mathbf W} = \frac {\partial C_2}{\partial \mathbf C} \frac {\partial C}{\partial \mathbf y} \mathbf x$$
>
> |Solution|
> |---|
> |$$\frac {\partial C_2}{\partial \mathbf W} = 3 \frac {\partial C}{\partial \mathbf y} \mathbf x$$|

> $$\frac {\partial C_2}{\partial \mathbf b} = \frac {\partial C_2}{\partial \mathbf C} \frac {\partial C}{\partial \mathbf y} \mathbf x$$
>
> |Solution|
> |---|
> |$$\frac {\partial C_2}{\partial \mathbf b} = 3 \frac {\partial C}{\partial \mathbf y}$$|
>

### 1.2. Softmax Module

The **softmax expression** is at the crux of **multi-class classification**.

After receiving $K$ unconstrained values in the form of a vector $x \in R^K$, the softmax function **normalizes** these values to $K$ positive values that all **sum to 1**.

The softmax is defined as
> $$\mathbf y = softmax_\beta (\mathbf x)$$
>
> $$y_k = \frac{exp(\beta x_k)}{\sum_n exp(\beta x_n)}$$
>(2)
>
> where
> 
> $\sum_y^K y_k = 1$,
>
> $y_k \leq 0$ for all $k$,
>
> $exp(z) = e^z$

We usually let the softmax input $x \in R^K$ be the output of a preceding module (some feature representation) whose input is a **datapoint** $d$.

Then we interpret $y_k$ as the **probability** that **datapoint** $d$ belongs to **class** $k$.

Since this module can be connected to others in backprop using the chain rule, we just need
to compute the softmax's gradient in isolation.

What is the expression for $\frac {\partial y_i}{\partial x_j}$ ?

*(Hint: Answer differs when $i = j$ and $i \neq j$).*

#### Solution:

Lets rewrite the softmax function as:

> $$y_i = \frac{e^{\beta x_i}}{\sum_{j=1}^K e^{\beta x_j}}$$
>
> $$= \frac{e^{\beta x_i}}{e^{\beta x_1} + e^{\beta x_2}+ \dots + e^{\beta x_j} + \dots + e^{\beta x_{K-1}} + e^{\beta x_K}}$$

Now it's more clear that if we want to take $\frac {\partial y_i}{\partial x_j}$, we have two cases:

1. When $i=j \implies$ we have
   
   * **one variable in the numerator**: $e^{\beta x_i}$
   
   * and another one in the **denominator**: $e^{\beta x_j}$


2. When $i \neq j \implies$ we only have one variable in the denominator: $e^{\beta x_j}$
   
   Because we are taking the partial derivative w.r.t $x_j$, which is different from $x_i$.


|Solution|
|--|
|$$\frac {\partial y_i}{\partial x_j} = \begin{cases} \
y_i (1 - y_i) & \text{if} & i = j \\
-y_i y_j & \text{if} & i \neq j \\
\end{cases}$$|

**Derivation:**

*TODO: write it cleaner, without errors*

![](./img/hw1-sigmoid-partial-deriv.jpg)