### Gradient

- **Definition**: The gradient of a scalar-valued function $f(\textbf{x})$ is a column vector of partial derivatives with respect to each element of a column vector $\textbf{x}$.
  
- **Mathematical Expression**: If $f(\textbf{x})$ is a scalar function of $\textbf{x} = \begin{bmatrix} x_1 & x_2 & \dots & x_n \end{bmatrix}^T$, then the gradient is:
  $$
  \nabla f(\textbf{x}) = \begin{bmatrix} 
  \frac{\partial f}{\partial x_1} & \frac{\partial f}{\partial x_2} & \dots & \frac{\partial f}{\partial x_n}
  \end{bmatrix}^T
  $$

**Example**: If $f(x, y) = x^2 + 3xy$, then 
$$
\nabla f(x, y) = \begin{bmatrix} 
\frac{\partial f}{\partial x} & \frac{\partial f}{\partial y}
\end{bmatrix}^T
= \begin{bmatrix} 
2x + 3y & 3x
\end{bmatrix}^T
$$
This is the gradient of the scalar function $f(x, y)$, representing the rate of change with respect to each variable and pointing in the direction of the steepest ascent at $(x, y)$.


### Jacobian

- **Definition**: The Jacobian is a matrix of all first-order partial derivatives of a vector-valued function.

- **Mathematical Expression**: For a vector-valued function $\mathbf{y}(\textbf{x}) = \begin{bmatrix} y_1 & y_2 \end{bmatrix}^T$, where $\mathbf{x} = \begin{bmatrix} x_1 & x_2 \end{bmatrix}^T$, the Jacobian matrix is:
  $$
  J = \begin{bmatrix} 
  \frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} \\
  \frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2}
  \end{bmatrix}
  $$

**Example**: If $\mathbf{y}(x, y) = \begin{bmatrix} x^2 + 3xy & 2x + y \end{bmatrix}^T$, the Jacobian matrix is:
$$
J = \begin{bmatrix} 
\frac{\partial (x^2 + 3xy)}{\partial x} & \frac{\partial (x^2 + 3xy)}{\partial y} \\
\frac{\partial (2x + y)}{\partial x} & \frac{\partial (2x + y)}{\partial y}
\end{bmatrix}
\hspace{1cm}
J = \begin{bmatrix} 
2x + 3y & 3x \\
2 & 1
\end{bmatrix}
$$

This **2 x 2 matrix** represents the rate of change of each component of the output vector $\mathbf{y}$ with respect to each input variable.


### Top-5 Matrix Calculus Rules ###

### Rule-1 ###

Given a function $f(x) = a^T x$, where:
- $a$ is a $n \times 1$ vector,
- $x$ is a $n \times 1$ vector,

the gradient of $f(x)$ with respect to $x$ is:

$$
\nabla_x f = a
$$




In [1]:
import torch
torch.manual_seed(47)

a = torch.randn(2, 1)
x = torch.randn(2, 1, requires_grad=True)

def grad_f(x, a):
    f = a.T @ x
    f.backward()
    return x.grad

expected_gradient = a
calculated_gradient = grad_f(x, a)

assert torch.allclose(expected_gradient, calculated_gradient)
print(calculated_gradient.tolist())

Calculated Gradient:
[[-1.4624308347702026], [0.7523223161697388]]


### Rule-2 ###

Given a function $ f(x) = A x $, where:
- $ A $ is an $ m \times n $ matrix,
- $ x $ is an $ n \times 1 $ vector,

the Jacobian of $ f(x) $ with respect to $ x $ is:

$$
\mathbf{J}_{f(x)} = A
$$


In [2]:
import torch
torch.manual_seed(47)

# Define A as a 2x3 matrix and x as a 3x1 vector
A = torch.randn(2, 3)
x = torch.randn(3, 1, requires_grad=True)

# f is a vector-valued function (in and out: vector)
def f(x):
    return A @ x

jacobian = (
    torch.autograd
    .functional
    .jacobian(f, x)
    .reshape(2, -1)
)

expected_jacobian = A
assert torch.allclose(jacobian, expected_jacobian)

### Rule-3

Given a function $f(x) = x^T A x$, where:
- $A$ is a $n \times n$ matrix,
- $x$ is a $n \times 1$ vector,

the gradient of $f(x)$ with respect to $x$ is:

$$
\nabla_x f = A x + A^T x
$$

#### Condition on $A$:
- If $A$ is **symmetric** ($A = A^T$), the gradient simplifies to, $\nabla_x f = 2 A x$


In [3]:
import torch; torch.manual_seed(47)

A = torch.randn(2, 2); 
x = torch.randn(2, 1, requires_grad=True)

def grad_f(A, x):
    f = x.T @ A @ x
    f.backward()
    return x.grad

expected_gradient = A @ x + A.T @ x
calculated_gradient = grad_f(A, x)

assert torch.allclose(expected_gradient, calculated_gradient)
print(calculated_gradient.tolist())

[[-2.916311502456665], [0.7888590097427368]]


### Rule-4 ###

Given a function $f(x, y) = x^T A y$, where:
- $A$ is a $n \times n$ matrix,
- $x$ is a $n \times 1$ vector,
- $y$ is a $n \times 1$ vector,

the gradients of $f(x, y)$ with respect to $x$ and $y$ are:

$$
\nabla_x f = A y
$$

$$
\nabla_y f = A^T x
$$


In [4]:
import torch; torch.manual_seed(47)

A = torch.randn(2, 2)
x = torch.randn(2, 1, requires_grad=True)
y = torch.randn(2, 1, requires_grad=True)

def grad_f(A, x, y):
    f = x.T @ A @ y
    f.backward()
    return x.grad, y.grad

expected_grad_x = A @ y
expected_grad_y = A.T @ x
calculated_grad_x, calculated_grad_y = grad_f(A, x, y)

assert torch.allclose(expected_grad_x, calculated_grad_x)
assert torch.allclose(expected_grad_y, calculated_grad_y)


print("Calculated Gradient with respect to y:")
print(calculated_grad_y)

Calculated Gradient with respect to y:
tensor([[-2.9293],
        [ 1.1403]])


### Rule-5

Given a function $ f(X) = a^T X b $, where:
- $ a $ is a $n \times 1 $ column vector,
- $ X $ is a $n \times m $ matrix,
- $ b $ is a $m \times 1 $ column vector,

the gradient of $ f(X) $ with respect to $X $ is:

$$
\nabla_X (a^T X b) = a b^T
$$


In [5]:
import torch
torch.manual_seed(47)

a = torch.randn(3, 1)  
b = torch.randn(2, 1)  
zX = torch.randn(3, 2, requires_grad=True) 

def grad_f(X, a, b):
    f = a.T @ X @ b 
    f.backward()  
    return X.grad

calculated_grad_X = grad_f(X, a, b)
expected_grad_X = a @ b.T
assert torch.allclose(expected_grad_X, calculated_grad_X)
print(calculated_grad_X)

tensor([[-0.8419, -0.8833],
        [ 0.4331,  0.4544],
        [-0.9886, -1.0373]])
