# Refresher: Derivatives, Gradients, Jacobians


#### Scalar valued functions
Let $f$ be a scalar valued function and defined as
(say $f:\mathbb{R}^1 \mapsto \mathbb{R}^1$). The derivative of $f$ is computed as 

$$ \frac{\partial f(x)}{\partial x} = lim_{h \rightarrow 0} \frac{f(x+h) - f(x)}{h}.$$


Let $f$ be a scalar valued function and defined as
(say $f:\mathbb{R}^m \mapsto \mathbb{R}^1$). The derivative of $f$ is a vector of partial derivatives. Similarly,
$\frac{\partial f(x)}{\partial x_i}$ reveals us how much $f(x)$ increases if $x_i$ increases. Strictly speaking, **gradients** are only defined for scalar functions. For vector valued functions we are dealing with vector of partial derivatives.

#### Vector  valued functions
Let $f$ be a vector valued function (say $f:\mathbb{R}^n \mapsto \mathbb{R}^m$).

$$ f(\vec{x}):\begin{bmatrix}
\vec{x}_1\\
\vec{x}_2\\
\cdots\\
\vec{x}_n
\end{bmatrix}
\rightarrow
\begin{bmatrix}
\vec{y}_1\\
\vec{y}_2\\
\cdots\\
\vec{y}_m
\end{bmatrix}$$


Then the gradient of $\vec{y}=f(\vec{x})$ with respect to $\vec{x}$ is a Jacobian matrix:

$$ \begin{align}J=\left(\begin{array}{ccc}
   \frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{1}}{\partial x_{n}}\\
   \vdots & \ddots & \vdots\\
   \frac{\partial y_{m}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}}
   \end{array}\right)\end{align} $$
   



# Jacobian of Softmax

In [1]:
import numpy as np
def softmax_func(z):
    z -= np.max(z)
    sm = (np.exp(z).T / np.sum(np.exp(z), axis=0)).T
    return sm
def softmax_grad(s):
    # Take the derivative of softmax element w.r.t the each logit which is usually Wi * X
    # input s is softmax value of the original input x.
    # s.shape = (1, n)
    # i.e. s = np.array([0.3, 0.7]), x = np.array([0, 1])
    # initialize the 2-D jacobian matrix.
    jacobian_m = np.diag(s)
    for i in range(len(jacobian_m)):
        for j in range(len(jacobian_m)):
            if i == j:
                jacobian_m[i][j] = s[i] * (1 - s[i])
            else:
                jacobian_m[i][j] = -s[i] * s[j]
    return jacobian_m

def softmax_grad_vec(softmax):
    # Reshape the 1-d softmax to 2-d so that np.dot will do the matrix multiplication
    s = softmax.reshape(-1, 1)
    return np.diagflat(s) - np.dot(s, s.T)

x=np.array([-1.3, 0.5, 2.1])

sx=softmax_func(x)
print('X:\n{0} \t \nS:\n{1}'.format(x,sx))
print('J:\n',softmax_grad(sx))

X:
[-3.4 -1.6  0. ] 	 
S:
[0.02701699 0.16344326 0.80953975]
J:
 [[ 0.02628707 -0.00441574 -0.02187133]
 [-0.00441574  0.13672956 -0.13231381]
 [-0.02187133 -0.13231381  0.15418514]]


# **The diagonal of the Jacobian of Softmax** is always **positive** wherase all others are **negative**.



## Lets combine all

Let $\vec{x}\in \mathbb{R}^n$ and $y \in \{0,1\}$. Moreover, let $f:\mathbb{R}^m \mapsto \mathbb{R}^n$ and $g:\mathbb{R}^n \mapsto \mathbb{R}^1$. Let $f$ and $g$ are defined as $g(\vec{x})= \sum_i \vec{x}_i , f(\vec{x})=\vec{x} \circ 2 \text{, where } \circ \text{ denotes Hadamard product}$. Then derivatives are computed as shown below.



$$ \frac{\partial f(\vec{x})}{\partial \vec{x}}=\begin{align}J=\left(\begin{array}{ccc}
   \frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{1}}{\partial x_{n}}\\
   \vdots & \ddots & \vdots\\
   \frac{\partial y_{m}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}}
   \end{array}\right)\end{align} $$


$$ \frac{\partial g(\vec{y})}{\partial \vec{y} }=\begin{align}
   \left(\begin{array}{c}
   \frac{\partial l}{\partial y_{1}}\\
   \vdots\\
   \frac{\partial l}{\partial y_{m}}
   \end{array}\right)
   \end{align}$$

where $l=g(\vec{y})$ and $\vec{y}=f(\vec{x})$. Then by the chain rule, the vector-Jacobian product would be the
gradient of $l$ with respect to $\vec{x}$:

$$
\begin{align}J^{T}\cdot v=\left(\begin{array}{ccc}
   \frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{1}}\\
   \vdots & \ddots & \vdots\\
   \frac{\partial y_{1}}{\partial x_{n}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}}
   \end{array}\right)\left(\begin{array}{c}
   \frac{\partial l}{\partial y_{1}}\\
   \vdots\\
   \frac{\partial l}{\partial y_{m}}
   \end{array}\right)=\left(\begin{array}{c}
   \frac{\partial l}{\partial x_{1}}\\
   \vdots\\
   \frac{\partial l}{\partial x_{n}}
   \end{array}\right)\end{align}$$



$$\frac{\partial g(x)}{\partial f(x)}=1, \frac{\partial f(x)}{\partial x_i}=2$$

$$\frac{\partial g(x)}{\partial x_i}=  \frac{\partial f(x)}{\partial x_i} *(\frac{\partial g(x)}{\partial f(x)})$$


In [2]:
import torch
from torch import nn
x = torch.ones(2, 2, requires_grad=True)
print(x)
def f(x):
    return x*2
def g(x):
    return x.sum()
fx=f(x)
print(fx)
gfx=g(fx)
print(gfx)
gfx.backward()
print(x.grad)

tensor([[1., 1.],
        [1., 1.]], requires_grad=True)
tensor([[2., 2.],
        [2., 2.]], grad_fn=<MulBackward0>)
tensor(8., grad_fn=<SumBackward0>)
tensor([[2., 2.],
        [2., 2.]])


In [3]:
y= x+2
y # AddBackward0 object indicating previous computation.

tensor([[3., 3.],
        [3., 3.]], grad_fn=<AddBackward0>)

In [4]:
z = y * y * 3
print(z) # MulBackward0 indicating previous computation.
o = z.mean() # to obtain a scalar value
o

tensor([[27., 27.],
        [27., 27.]], grad_fn=<MulBackward0>)


tensor(27., grad_fn=<MeanBackward0>)

In [5]:
x = torch.randn(3, requires_grad=True)

y = x * 2
while y.data.norm() < 1000:
    y = y * 2

print(y)

tensor([-334.0398, -726.5619,  751.9890], grad_fn=<MulBackward0>)


In [6]:
v = torch.tensor([0.1, 1.0, 0.0001], dtype=torch.float)
y.backward(v)

print(x.grad)

tensor([1.0240e+02, 1.0240e+03, 1.0240e-01])


In [7]:
print(x.requires_grad)
print((x ** 2).requires_grad)

with torch.no_grad():
    print((x ** 2).requires_grad)

True
True
False


In [8]:
x = torch.randn((1,3), requires_grad=True)
x

tensor([[ 1.4336,  0.6889, -0.8934]], requires_grad=True)

In [9]:
softmax=nn.Softmax(dim=1)
l=softmax(x)

In [10]:
k=l*2
k=k.sum()

In [11]:
k.backward()