## 2.4.1. Derivatives and Differentiation

#### Problem Statement

In science, engineering and beyond, we are often faced with functions that describe how one quantity depends on another—distance as a function of time, concentration as a function of volume, cost as a function of units produced, and so on.  The **derivative** of a function provides a precise way to measure the **instantaneous rate of change** of that quantity: it tells us, at any given point, how fast the output is increasing or decreasing with respect to its input.

For example, if $s(t)$ denotes the position of a car at time $t$, then the derivative $s'(t)$ is the car’s **velocity**.  If $P(t)$ models the size of a bacterial population, then $P'(t)$ gives the **growth rate**, which might accelerate or slow over time depending on nutrients or competition.  In economics, if $R(q)$ is the revenue from selling $q$ units of a product, then the derivative $R'(q)$ is the **marginal revenue**, the additional income earned by selling one more unit.

Derivatives also appear in many less obvious contexts:

- **Chemical kinetics**: if $C(v)$ describes the concentration of a reactant as a function of volume $v$, then $dC/dv$ measures how dilution changes concentration.  
- **Computer networks**: if $T(t)$ is the cumulative data transferred by time $t$, then $T'(t)$ is the **throughput** or instantaneous data rate.  
- **Geometry and graphics**: the slope of a curve $y=f(x)$ at a point, given by $f'(x)$, defines the direction of its tangent line and underlies algorithms for rendering smooth shapes.

Because so many practical problems reduce to “how fast is this changing?”, mastering the basic rules of differentiation—the constant, power, exponential, logarithm, sum, product and quotient rules, and so on—allows us to turn a wide variety of real‑world questions into straightforward mechanical calculations.

Put simply, a *derivative* is the rate of change in a function with respect to changes in its arguments. Derivatives can tell us how rapidly a loss function would increase or decrease were we to *increase* or *decrease* each parameter by an infinitesimally small amount. 

Formally, for functions $ f : \mathbb{R} \rightarrow \mathbb{R} $, that map from scalars to scalars, the *derivative* of $ f $ at a point $ x $ is defined as

$$
f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}.                
\tag{2.4.1}
$$




This term on the right hand side is called a *limit* and it tells us what happens to the value of an expression as a specified variable approaches a particular value. This limit tells us what the ratio between a perturbation *h* and the change in the function value $f(x + h) - f(x) $ converges to as we shrink its size to zero.

When $f'(x)$ exists, $f$ is said to be differentiable at $x$; and when $f'(x)$ exists for all $x$ on a set, e.g., the interval $[a, b]$, we say that $f$ is differentiable on this set. Not all functions are differentiable, including many that we wish to optimize, such as accuracy and the area under the receiving operating characteristic (AUC). However, because computing the derivative of the loss is a crucial step in nearly all algorithms for training deep neural networks, we often optimize a differentiable *surrogate* instead.

We can interpret the derivative $f'(x)$ as the instantaneous rate of change of $f(x)$ with respect to $x$. Let’s develop some intuition with an example. Define $u = f(x) = 3x^2 - 4x$.

In [None]:
def f(x):
    return 3 * x ** 2 - 4 * x

Setting $x = 1$, we see that $\frac{f(x+h) - f(x)}{h}$ approaches $2$ as $h$ approaches $0$. While this experiment lacks the rigor of a mathematical proof, we can quickly see that indeed $f'(1) = 2$.

In [None]:
import numpy as np
for h in 10.0**np.arange(-1, -6, -1):
    print(f'h={h:.5f}, numerical limit={(f(1+h)-f(1))/h:.5f}')

h=0.10000, numerical limit=2.30000
h=0.01000, numerical limit=2.03000
h=0.00100, numerical limit=2.00300
h=0.00010, numerical limit=2.00030
h=0.00001, numerical limit=2.00003

There are several equivalent notational conventions for derivatives. Given \( y = f(x) \), the following expressions are equivalent:
$$
f'(x) = y' = \frac{dy}{dx} = \frac{df}{dx} = \frac{d}{dx} f(x) = Df(x) = D_x f(x),
\tag{2.4.2}
$$
where the symbols $\frac{d}{dx}$ and  $D$  are *differentiation operators*.  
Below, we present the derivatives of some common functions:



#### 1. Constant Rule
$$
\frac{d}{dx} C = 0 \quad \text{for any constant } C, \tag{2.4.3} 
$$ 
**Prove:** 
By definition,
$$
\frac{d}{dx} C = \lim_{h\to0}\frac{C - C}{h}
= \lim_{h\to0}\frac{0}{h}
= 0.
$$
Since the numerator is identically zero, the limit is zero.
**Example.**  
Let $f(x)=7$. Then $f'(x)=0.$

In [None]:
import sympy as sp

x = sp.symbols('x')
f = 7
sp.diff(f, x)
# → 0


#### 2. Power Rule
$$
\frac{d}{dx} x^n = nx^{n-1} \quad \text{for } n \neq 0 \tag{2.4.4}
$$
**Prove**:
**a)** $n$ a positive integer

Use the binomial expansion:
$$
(x+h)^n
= \sum_{k=0}^n \binom nk x^{\,n-k}h^k
= x^n + n x^{\,n-1}h + \sum_{k=2}^n\binom nk x^{\,n-k}h^k.
$$
Then
$$
\frac{(x+h)^n - x^n}{h}
= n\,x^{\,n-1} + O(h),
$$
so taking $h\to0$ gives
$$
\frac{d}{dx}x^n = n\,x^{\,n-1}.
$$

**b)** $n$ a negative integer

Write $n=-m$ with $m>0$. Then $x^n = 1/x^m$.  By the quotient (or chain) rule,
$$
\frac{d}{dx}x^{-m}
= -\,x^{-m-1}\,(m\,x^{\,m-1})
= -m\,x^{-(m+1)}
= n\,x^{\,n-1}.
$$
**Example.**  
Let $f(x)=x^5$. Then $f'(x)=5\,x^4$

In [None]:
import sympy as sp

x = sp.symbols('x')
f = x**5
sp.diff(f, x)
# → 5*x**4


#### 3. Exponential Rule
$$
\frac{d}{dx} e^x = e^x \tag{2.4.5}
$$
**Prove:**
Via power series

$$
e^x = \sum_{k=0}^\infty \frac{x^k}{k!}
\;\implies\;
\frac{d}{dx}e^x
= \sum_{k=1}^\infty \frac{k x^{k-1}}{k!}
= \sum_{j=0}^\infty \frac{x^j}{j!}
= e^x.
$$

Via the limit definition of \(e\)

$$
\frac{d}{dx}e^x
=\lim_{h\to0}\frac{e^{x+h}-e^x}{h}
=e^x\lim_{h\to0}\frac{e^h-1}{h}
=e^x\cdot1
= e^x.
$$

**Example.**  
Let $f(x)=e^x$. Then $f'(x)=e^x.$

In [None]:
import sympy as sp

x = sp.symbols('x')
f = sp.exp(x)
sp.diff(f, x)
# → exp(x)


#### 4. Logarithm Rule
$$
\frac{d}{dx} \ln x = x^{-1} \tag{2.4.6}
$$
**Prove**:
Since $y = \ln x$ is the inverse of $x = e^y$,
$$
1 = \frac{d}{dx}(e^y)
  = e^y\frac{dy}{dx}
  = x\frac{dy}{dx}
\quad\Longrightarrow\quad
\frac{dy}{dx} = \frac1x.
$$
Hence
$$
\frac{d}{dx}\ln x = \frac1x.
$$
**Example.**  
Let $f(x)=\ln(x)$. Then $f'(x)=\frac1x.$

In [None]:
import sympy as sp

x = sp.symbols('x')
f = sp.log(x)
sp.diff(f, x)
# → 1/x



Functions composed from differentiable functions are often themselves differentiable.  
The following rules come in handy for working with compositions of any differentiable functions $ f $ and $ g $, and constant $ C $:







#### 1. Constant Multiple Rule
$$
\frac{d}{dx} [Cf(x)] = C \frac{d}{dx} f(x) \quad \tag{2.4.7}
$$
**Prove:**
For any constant $C$:
$$
\frac{d}{dx}\bigl[C\,f(x)\bigr]
=\lim_{h\to0}\frac{C\,f(x+h)-C\,f(x)}{h}
=\lim_{h\to0}\frac{C\bigl[f(x+h)-f(x)\bigr]}{h}
=C\;\lim_{h\to0}\frac{f(x+h)-f(x)}{h}
=C\,f'(x).
$$
**Example.**  
Let $f(x)=5\,x^3$.  Then by the power rule,  
$$
f'(x)=5\cdot3\,x^2 = 15\,x^2.
$$

In [None]:
import sympy as sp

x = sp.symbols('x')
f = 5*x**3
sp.diff(f, x)
# → 15*x**2


#### 2. Sum Rule
$$
\frac{d}{dx} [f(x) + g(x)] = \frac{d}{dx} f(x) + \frac{d}{dx} g(x) \quad  \tag{2.4.8}
$$
**Prove:**
For any two functions $f,g$:
$$
\frac{d}{dx}\bigl[f(x)+g(x)\bigr]
=\lim_{h\to0}\frac{[f(x+h)+g(x+h)]-[f(x)+g(x)]}{h}
=\lim_{h\to0}\frac{f(x+h)-f(x)}{h}
  +\lim_{h\to0}\frac{g(x+h)-g(x)}{h}
=f'(x)+g'(x).
$$
**Example.**  
Let $f(x)=x^2 + 3x$.  Then $f'(x) = 2x + 3.$


In [None]:
import sympy as sp

x = sp.symbols('x')
f = x**2 + 3*x
sp.diff(f, x)
# → 2*x + 3


#### 3. Product Rule
$$
\frac{d}{dx} [f(x)g(x)] = f(x) \frac{d}{dx} g(x) + g(x) \frac{d}{dx} f(x) \quad \tag{2.4.9}
$$
**Prove:**
$$
\begin{aligned}
\frac{d}{dx}\bigl[f(x)g(x)\bigr]
&=\lim_{h\to0}\frac{f(x+h)g(x+h)-f(x)g(x)}{h}\\
&=\lim_{h\to0}\frac{f(x+h)g(x+h)-f(x)g(x+h)+f(x)g(x+h)-f(x)g(x)}{h}\\
&=\lim_{h\to0}\frac{\bigl[f(x+h)-f(x)\bigr]\,g(x+h)}{h}
  +\lim_{h\to0}\frac{f(x)\,\bigl[g(x+h)-g(x)\bigr]}{h}\\
&=\Bigl(\lim_{h\to0}\frac{f(x+h)-f(x)}{h}\Bigr)\,g(x)
  +f(x)\,\Bigl(\lim_{h\to0}\frac{g(x+h)-g(x)}{h}\Bigr)\\
&=f'(x)\,g(x)+f(x)\,g'(x).
\end{aligned}
$$
**Example.**  
Let $h(x)=x^2\sin x$.  Then $h'(x)=x^2\cos x + 2x\sin x.$

In [None]:
import sympy as sp

x = sp.symbols('x')
h = x**2 * sp.sin(x)
sp.diff(h, x)
# → x**2*cos(x) + 2*x*sin(x)


#### 4. Quotient Rule
$$
\frac{d}{dx} \left( \frac{f(x)}{g(x)} \right) = \frac{g(x) \frac{d}{dx} f(x) - f(x) \frac{d}{dx} g(x)}{g^2(x)} \quad \tag{2.4.10}
$$
**Prove:**
Assume $g(x)\neq0$.  Write
$$
\frac{f(x)}{g(x)} = f(x)\,\bigl[g(x)\bigr]^{-1}.
$$
Then by the product rule and the chain rule (derivative of $u^{-1}$ is $-u^{-2}u'$):
$$
\frac{d}{dx}\frac{f}{g}
=\frac{d}{dx}\bigl(f\cdot g^{-1}\bigr)
=f'\,g^{-1}+f\;\bigl(-g^{-2}g'\bigr)
=\frac{f'}{g}-\frac{f\,g'}{g^2}
=\frac{g\,f' - f\,g'}{g^2}.
$$

**Example.**  
Let $q(x)=\frac{x^2}{1+x}$.  Then  
$$
q'(x)
= \frac{(1+x)\cdot2x - x^2\cdot1}{(1+x)^2}
= \frac{2x+2x^2 - x^2}{(1+x)^2}
= \frac{x(2+x)}{(1+x)^2}.
$$

In [None]:
import sympy as sp

x = sp.symbols('x')
q = x**2/(1+x)
sp.diff(q, x)
# → x*(x + 2)/(x + 1)**2


## 2.4.2. Visualization Utilities

In [None]:
!pip install d2l

In [None]:
%matplotlib inline
import numpy as np
from matplotlib_inline import backend_inline
from d2l import torch as d2l

In [None]:
# Saved in the d2l package for later use
def use_svg_display():
    """Use the svg format to display a plot in Jupyter."""
    backend_inline.set_matplotlib_formats('svg')

In [None]:
def set_figsize(figsize=(3.5, 2.5)):
    """Set the figure size for matplotlib."""
    use_svg_display()
    d2l.plt.rcParams['figure.figsize'] = figsize

In [None]:
def set_axes(axes, xlabel, ylabel, xlim, ylim, xscale, yscale, legend):
    """Set the axes for matplotlib."""
    axes.set_xlabel(xlabel), axes.set_ylabel(ylabel)
    axes.set_xscale(xscale), axes.set_yscale(yscale)
    axes.set_xlim(xlim),     axes.set_ylim(ylim)
    if legend:
        axes.legend(legend)
    axes.grid()


In [None]:
def plot(X, Y=None, xlabel=None, ylabel=None, legend=[], xlim=None,
         ylim=None, xscale='linear', yscale='linear',
         fmts=('-', 'm--', 'g-.', 'r:'), figsize=(3.5, 2.5), axes=None):
    """Plot data points."""

    def has_one_axis(X):  # True if X (tensor or list) has 1 axis
        return (hasattr(X, "ndim") and X.ndim == 1 or isinstance(X, list)
                and not hasattr(X[0], "__len__"))

    if has_one_axis(X): X = [X]
    if Y is None:
        X, Y = [[]] * len(X), X
    elif has_one_axis(Y):
        Y = [Y]
    if len(X) != len(Y):
        X = X * len(Y)

    set_figsize(figsize)
    if axes is None:
        axes = d2l.plt.gca()
    axes.cla()
    for x, y, fmt in zip(X, Y, fmts):
        axes.plot(x,y,fmt) if len(x) else axes.plot(y,fmt)
    set_axes(axes, xlabel, ylabel, xlim, ylim, xscale, yscale, legend)


In [None]:
def f(x):
    return 3 * x ** 2 - 4 * x

In [None]:
x = np.arange(0, 3, 0.1)
plot(x, [f(x), 2 * x - 3], 'x', 'f(x)', legend=['f(x)', 'Tangent line (x=1)'])


In case you can't run: I already run it, you can check it here: https://colab.research.google.com/drive/1-NpioIlXCycAxQMpy_SMytaofpnISf8J?usp=sharing

## 2.4.3 Partial Derivatives and Gradients

Thus far, we have been differentiating functions of just one variable. In deep learning, we also need to work with functions of *many* variables. We briefly introduce notions of the derivative that apply to such *multivariate* functions.

Let $y = f(x_1, x_2, \dots, x_n)$ be a function with **n** variables. The *partial derivative* of y with respect to its $i^{th}$ parameter $x_i$ is
$$
\frac{\partial y}{\partial x_i}
\;=\;
\lim_{h \to 0}
\frac{f(x_1,\dots,x_{i-1},x_i + h,\,x_{i+1},\dots,x_n) \;-\; f(x_1,\dots,x_i,\dots,x_n)}{h}.
\tag{2.4.6}
$$

To calculate $\frac{\partial y}{\partial x_i}$, we treat all other $x_1, \dots, x_{i-1}, x_{i+1}, \dots, x_{n}$ as constants and calculate the derivative of $y$ with respect to $x_i$. The following notational conventions for partial derivatives are all common and all mean the same thing:
$$
\frac{\partial y}{\partial x_i}
=
\frac{\partial f}{\partial x_i}
=
\partial_{x_i} f
=
\partial_i f
=
f_{x_i}
=
f_i
=
D_i f
=
D_{x_i} f.
\tag{2.4.7}
$$
We can concatenate partial derivatives of a multivariate function with respect to all its variables to obtain a vector that is called the gradient of the function. Suppose that the input of function $f:\mathbb R^n\to\mathbb R$ is an $n$-dimensional vector $\mathbf x=[x_1,\dots,x_n]^T$ and the output is a scalar. The gradient of the function $f$ with respect to $x$ is a vector of $n$ partial derivatives:
$$
\nabla_{\mathbf x}f(\mathbf x)
=
\begin{bmatrix}
\partial_{x_1}f(\mathbf x)\\
\partial_{x_2}f(\mathbf x)\\
\vdots\\
\partial_{x_n}f(\mathbf x)
\end{bmatrix}.
\tag{2.4.8}
$$

When there is no ambiguity $\nabla_{\mathbf x} f(\mathbf x)$ is typically replaced by $\nabla f(\mathbf x)$.

The following rules come in handy for differentiating multivariate functions:

- Rule 1: For all $\mathbf A\in\mathbb R^{m\times n}$ we have $\nabla_{\mathbf x}(\mathbf A\mathbf x)=\mathbf A^T$ and $\nabla_{\mathbf x}(\mathbf x^T \mathbf A)=\mathbf A$.

We have:
$$ \mathbf A
=
\begin{bmatrix}
A_{00}\hspace{0.5em} A_{01}\hspace{0.5em} \dots \hspace{0.5em} A_{0n} \\
A_{10}\hspace{0.5em} A_{11}\hspace{0.5em} \dots \hspace{0.5em} A_{1n}\\
\vdots\\
A_{m0}\hspace{0.5em} A_{m1}\hspace{0.5em} \dots \hspace{0.5em} A_{mn}
\end{bmatrix}
.\quad
\mathbf x
=
\begin{bmatrix}
x_{0} \\
x_{1}\\
\vdots\\
x_{n}
\end{bmatrix}
$$
Then
$$\mathbf A\mathbf x
=
\begin{bmatrix}
A_{00}x_{0} + A_{01}x_{1} + \dots + A_{0n}x_{n}\\
A_{10}x_{0} + A_{11}x_{1} + \dots + A_{1n}x_{n}\\
\vdots\\
A_{n0}x_{0} + A_{n1}x_{1} + \dots + A_{nn}x_{n}
\end{bmatrix}
$$
Set $f = \mathbf A\mathbf x$, then
$$
\nabla_{\mathbf x}f(\mathbf x)
=
\begin{bmatrix}
\partial_{x_0}f_{0}(\mathbf x) \hspace{0.5em} \partial_{x_1}f_{0}(\mathbf x) \hspace{0.5em} \dots \hspace{0.5em} \partial_{x_n}f_{0}(\mathbf x)\\
\partial_{x_0}f_{1}(\mathbf x) \hspace{0.5em} \partial_{x_1}f_{1}(\mathbf x) \hspace{0.5em} \dots \hspace{0.5em} \partial_{x_n}f_{1}(\mathbf x)\\
\vdots\\
\partial_{x_0}f_{n}(\mathbf x) \hspace{0.5em} \partial_{x_1}f_{n}(\mathbf x) \hspace{0.5em} \dots \hspace{0.5em} \partial_{x_n}f_{n}(\mathbf x)
\end{bmatrix} ^{T}
=
\begin{bmatrix}
A_{00}\hspace{0.5em} A_{01}\hspace{0.5em} \dots \hspace{0.5em} A_{0n} \\
A_{10}\hspace{0.5em} A_{11}\hspace{0.5em} \dots \hspace{0.5em} A_{1n}\\
\vdots\\
A_{m0}\hspace{0.5em} A_{m1}\hspace{0.5em} \dots \hspace{0.5em} A_{mn}
\end{bmatrix} ^{T}
= \mathbf A ^{T}
$$
So for all $\mathbf A\in\mathbb R^{m\times n}$, we have $$\nabla_{\mathbf x}(\mathbf A\mathbf x)=\mathbf A^T.$$
Similarly for all $\mathbf A\in\mathbb R^{n\times m}$, we have $$\nabla_{\mathbf x}(\mathbf x^T \mathbf A)=\mathbf A.$$
- Rule 2: For square matrices $\mathbf A\in\mathbb R^{n\times n}: \nabla_{\mathbf x}(\mathbf x^T \mathbf A\mathbf x)=(\mathbf A+\mathbf A^T)\mathbf x$, in particular $\nabla_{\mathbf x}\|\mathbf x\|^2=2\mathbf x$.

Let $f(\mathbf x) = \mathbf x^T \mathbf A\mathbf x = \sum_{i=1}^n \sum_{j=1}^n x_i A_{ij} x_j$, then
$$
\nabla_{\mathbf x}f(\mathbf x)
=
\begin{bmatrix}
\partial_{x_1}f(\mathbf x)\\
\partial_{x_2}f(\mathbf x)\\
\vdots\\
\partial_{x_n}f(\mathbf x)
\end{bmatrix}
=
\begin{bmatrix}
\sum_{j=1}^{n} \frac{\partial x_{1}A_{1j}x_{j}}{\partial x_1} + \sum_{i=1}^{n} \frac{\partial x_{i}A_{i1}x_{1}}{\partial x_1}\\
\sum_{j=1}^{n} \frac{\partial x_{2}A_{2j}x_{j}}{\partial x_2} + \sum_{i=1}^{n} \frac{\partial x_{i}A_{i2}x_{2}}{\partial x_2}\\
\vdots\\
\sum_{j=1}^{n} \frac{\partial x_{n}A_{nj}x_{j}}{\partial x_n} + \sum_{i=1}^{n} \frac{\partial x_{i}A_{in}x_{n}}{\partial x_n}
\end{bmatrix}
=
\begin{bmatrix}
\sum_{j=1}^{n} A_{1j}x_{j} + \sum_{i=1}^{n} x_{i}A_{i1}\\
\sum_{j=1}^{n} A_{2j}x_{j} + \sum_{i=1}^{n} x_{i}A_{i2}\\
\vdots\\
\sum_{j=1}^{n} A_{nj}x_{j} + \sum_{i=1}^{n} x_{i}A_{in}\\
\end{bmatrix}
$$

$$
\nabla_{\mathbf x}f(\mathbf x)
=
\begin{bmatrix}
\sum_{j=1}^{n} A_{1j}x_{j}\\
\sum_{j=1}^{n} A_{2j}x_{j}\\
\vdots\\
\sum_{j=1}^{n} A_{nj}x_{j}
\end{bmatrix}
+
\begin{bmatrix}
\sum_{i=1}^{n} x_{i}A_{i1}\\
\sum_{i=1}^{n} x_{i}A_{i2}\\
\vdots\\
\sum_{i=1}^{n} x_{i}A_{in}
\end{bmatrix}
=
\mathbf A \mathbf x + \mathbf A^{T} \mathbf x
=
(\mathbf A + \mathbf A^{T}) \mathbf x.
$$

Similarly, for any matrix $\mathbf X$, we have $\|\mathbf X\|_F^2 = \mathbf X^{T}\mathbf X = \mathbf X^{T}\mathbf I\mathbf X$, with $\mathbf I$ is Identity matrix.

Then, apply Rule 2, we have $\|\mathbf X\|_F^2 = \mathbf X^{T}\mathbf I\mathbf X = (\mathbf I + \mathbf I^{T})\mathbf X = 2\mathbf I\mathbf X = 2\mathbf X$

So, $\nabla_X\|X\|_F^2=2X.$

## **2.4.4. Chain Rule**

### **4.1 Introduction to the Chain Rule**

The Chain Rule is a mathematical tool used to compute the derivative of composite functions.  
In deep learning, we deal with complex, nested functions across multiple layers. The Chain Rule enables us to compute gradients for optimization.

There are two main cases:

- **Single-variable functions**: $y = f(g(x))$
- **Multivariable functions**: $y = f(\mathbf{u})$, where $\mathbf{u} = g(\mathbf{x})$

### **4.2 Chain Rule for Single-Variable Functions**

**Formula:** If $y = f(u)$ and $u = g(x)$, and both $f$ and $g$ are differentiable at their respective points, then:

$$
\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}
$$


**Explanation:**

- $\frac{dy}{dx}$: Rate of change of $y$ with respect to $x$
- $\frac{dy}{du}$: Rate of change of $y$ with respect to $u$
- $\frac{du}{dx}$: Rate of change of $u$ with respect to $x$

The Chain Rule "chains" these derivatives together.

**Example: Find the derivative of the function $y = \sin(x^2)$.**

**Step 1: Identify the composition**

Let
$$
u = x^2, \quad \text{then} \quad y = \sin(u).
$$

**Step 2: Apply the Chain Rule**

By the Chain Rule:

$$\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}$$

**Step 3: Compute each part**

- $\frac{dy}{du} = \cos(u)$  
- $\frac{du}{dx} = 2x$

**Step 4: Substitute back**

$$\frac{dy}{dx} = \cos(u) \cdot 2x = \cos(x^2) \cdot 2x$$

### **4.3 Chain Rule for Multivariable Functions**

**Formula:**  
If $y = f(\mathbf{u})$, where $\mathbf{u} = (u_1, u_2, \dots, u_m)$, and each $u_i = g_i(\mathbf{x})$, with $\mathbf{x} = (x_1, x_2, \dots, x_n)$, then:

$$
\frac{\partial y}{\partial x_i} = \frac{\partial y}{\partial u_1} \frac{\partial u_1}{\partial x_i} + \frac{\partial y}{\partial u_2} \frac{\partial u_2}{\partial x_i} + \cdots + \frac{\partial y}{\partial u_m} \frac{\partial u_m}{\partial x_i}
$$

**When computing the derivative of $y$ with respect to each $x_i$, we apply the chain rule:**

$$
\frac{\partial y}{\partial x_i} = \sum_{j=1}^{m} \frac{\partial y}{\partial u_j} \cdot \frac{\partial u_j}{\partial x_i}
$$

**In vector form:**

Instead of handling one variable at a time, we can organize the derivatives into vectors and matrices:

**Gradient with respect to $\mathbf{x}$:**
$$
\nabla_{\mathbf{x}} y =
\begin{bmatrix}
\frac{\partial y}{\partial x_1} \\
\frac{\partial y}{\partial x_2} \\
\vdots \\
\frac{\partial y}{\partial x_n}
\end{bmatrix}
\in \mathbb{R}^n
$$

**Gradient with respect to $\mathbf{u}$:**
$$
\nabla_{\mathbf{u}} y =
\begin{bmatrix}
\frac{\partial y}{\partial u_1} \\
\frac{\partial y}{\partial u_2} \\
\vdots \\
\frac{\partial y}{\partial u_m}
\end{bmatrix}
\in \mathbb{R}^m
$$

**Jacobian matrix $\mathbf{A}$ (partial derivatives of $\mathbf{u}$ with respect to $\mathbf{x}$):**
$$
\mathbf{A} =
\begin{bmatrix}
\frac{\partial u_1}{\partial x_1} & \frac{\partial u_2}{\partial x_1} & \cdots & \frac{\partial u_m}{\partial x_1} \\
\frac{\partial u_1}{\partial x_2} & \frac{\partial u_2}{\partial x_2} & \cdots & \frac{\partial u_m}{\partial x_2} \\
\vdots & \vdots & \ddots & \vdots \\
\frac{\partial u_1}{\partial x_n} & \frac{\partial u_2}{\partial x_n} & \cdots & \frac{\partial u_m}{\partial x_n}
\end{bmatrix}
\in \mathbb{R}^{n \times m}
$$

**Final Vectorized Chain Rule:**

$$
\nabla_{\mathbf{x}} y = \mathbf{A} \nabla_{\mathbf{u}} y
$$

Where:  
- $\mathbf{A} \in \mathbb{R}^{n \times m}$ is a matrix containing the partial derivatives $\frac{\partial u_j}{\partial x_i}$.  
- $\nabla_{\mathbf{x}} y$: Gradient of $y$ with respect to $\mathbf{x}$.  
- $\nabla_{\mathbf{u}} y$: Gradient of $y$ with respect to $\mathbf{u}$.

**Example:**

Suppose:

- $y = u_1^2 + u_2^2$  
- $u_1 = x_1 + x_2$, $u_2 = x_1 - x_2$

Compute $\frac{\partial y}{\partial x_1}$ and $\frac{\partial y}{\partial x_2}$.

**Step 1: Compute the partial derivatives**

- $\frac{\partial y}{\partial u_1} = 2u_1$  
- $\frac{\partial y}{\partial u_2} = 2u_2$  
- $\frac{\partial u_1}{\partial x_1} = 1$, $\frac{\partial u_1}{\partial x_2} = 1$  
- $\frac{\partial u_2}{\partial x_1} = 1$, $\frac{\partial u_2}{\partial x_2} = -1$

**Step 2: Apply the Chain Rule**

$$
\frac{\partial y}{\partial x_1} = \frac{\partial y}{\partial u_1} \cdot \frac{\partial u_1}{\partial x_1} + \frac{\partial y}{\partial u_2} \cdot \frac{\partial u_2}{\partial x_1} = 2u_1 \cdot 1 + 2u_2 \cdot 1 = 2u_1 + 2u_2
$$

$$
\frac{\partial y}{\partial x_2} = \frac{\partial y}{\partial u_1} \cdot \frac{\partial u_1}{\partial x_2} + \frac{\partial y}{\partial u_2} \cdot \frac{\partial u_2}{\partial x_2} = 2u_1 \cdot 1 + 2u_2 \cdot (-1) = 2u_1 - 2u_2
$$

**Step 3: Substitute $u_1 = x_1 + x_2$, $u_2 = x_1 - x_2$**

$$
\frac{\partial y}{\partial x_1} = 2(x_1 + x_2) + 2(x_1 - x_2) = 4x_1
$$

$$
\frac{\partial y}{\partial x_2} = 2(x_1 + x_2) - 2(x_1 - x_2) = 4x_2
$$


**Vector Form:**

- Gradient: $\nabla_{\mathbf{x}} y = \begin{bmatrix} \frac{\partial y}{\partial x_1} \\ \frac{\partial y}{\partial x_2} \end{bmatrix} = \begin{bmatrix} 4x_1 \\ 4x_2 \end{bmatrix}$  
- Jacobian Matrix: $\mathbf{A} = \begin{bmatrix} \frac{\partial u_1}{\partial x_1} & \frac{\partial u_1}{\partial x_2} \\ \frac{\partial u_2}{\partial x_1} & \frac{\partial u_2}{\partial x_2} \end{bmatrix} = \begin{bmatrix} 1 & 1 \\ 1 & -1 \end{bmatrix}$  
- Gradient w.r.t. $\mathbf{u}$: $\nabla_{\mathbf{u}} y = \begin{bmatrix} 2u_1 \\ 2u_2 \end{bmatrix}$

**Verification:**

$$
\nabla_{\mathbf{x}} y = \mathbf{A} \nabla_{\mathbf{u}} y = \begin{bmatrix} 1 & 1 \\ 1 & -1 \end{bmatrix} \begin{bmatrix} 2u_1 \\ 2u_2 \end{bmatrix} = \begin{bmatrix} 2u_1 + 2u_2 \\ 2u_1 - 2u_2 \end{bmatrix}
$$

Substitute $u_1 = x_1 + x_2$, $u_2 = x_1 - x_2$ to get:

$$
\nabla_{\mathbf{x}} y = \begin{bmatrix} 4x_1 \\ 4x_2 \end{bmatrix}
$$

This matches our earlier result.

