Chain Rule Proof for Derivative of $y$ with Respect to $x_i$
Chain Rule Setup
Consider a function $y = f(\mathbf{u})$ where $\mathbf{u} = [u_1, u_2, \ldots, u_m]$ is a vector of intermediate variables, and each $u_j$ depends on the vector $\mathbf{x} = [x_1, x_2, \ldots, x_n]$. Thus, we have $u_j = g_j(\mathbf{x})$ for each $j = 1, 2, \ldots, m$.
The goal is to compute $\frac{\partial y}{\partial x_i}$, which represents how $y$ changes with respect to the $i^\text{th}$ variable $x_i$, considering the dependencies of $y$ on the intermediate variables $u_1, u_2, \ldots, u_m$, and each $u_j$ in turn depends on $\mathbf{x}$.
Derivation
1. Define the Function
We have $y = f(u_1, u_2, \ldots, u_m)$, where $u_j = g_j(\mathbf{x})$ for each $j$.
2. Total Derivative of $y$ with Respect to $x_i$
To compute $\frac{\partial y}{\partial x_i}$, we apply the total derivative by summing over all intermediate variables $u_j$:
$\frac{\partial y}{\partial x_i} = \sum_{j=1}^{m} \frac{\partial y}{\partial u_j} \frac{\partial u_j}{\partial x_i}$
This equation expresses how $y$ changes with respect to $x_i$ via its dependence on the intermediate variables $u_1, u_2, \ldots, u_m$, each of which depends on $x_i$.
3. Gradient of $y$
The gradient of $y$ with respect to $\mathbf{x}$ is a vector that contains all the partial derivatives of $y$ with respect to each $x_i$:
$\nabla_{\mathbf{x}} y = \left[ \frac{\partial y}{\partial x_1}, \frac{\partial y}{\partial x_2}, \ldots, \frac{\partial y}{\partial x_n} \right]^\top$
Using the chain rule, the gradient of $y$ with respect to $\mathbf{x}$ is:
$\nabla_{\mathbf{x}} y = \mathbf{A} \nabla_{\mathbf{u}} y$
where $\mathbf{A}$ is a matrix of partial derivatives of the intermediate variables $\mathbf{u}$ with respect to $\mathbf{x}$:
$\mathbf{A} = \begin{bmatrix}
\frac{\partial u_1}{\partial x_1} & \frac{\partial u_1}{\partial x_2} & \cdots & \frac{\partial u_1}{\partial x_n} \
\frac{\partial u_2}{\partial x_1} & \frac{\partial u_2}{\partial x_2} & \cdots & \frac{\partial u_2}{\partial x_n} \
\vdots & \vdots & \ddots & \vdots \
\frac{\partial u_m}{\partial x_1} & \frac{\partial u_m}{\partial x_2} & \cdots & \frac{\partial u_m}{\partial x_n}
\end{bmatrix}$
4. Matrix Product
To compute the gradient, we multiply the matrix $\mathbf{A}$ by the gradient of $y$ with respect to $\mathbf{u}$:
$\nabla_{\mathbf{u}} y = \left[ \frac{\partial y}{\partial u_1}, \frac{\partial y}{\partial u_2}, \ldots, \frac{\partial y}{\partial u_m} \right]^\top$
This gives us a way to compute the total gradient of $y$ with respect to $\mathbf{x}$.
Conclusion
The chain rule for multivariate functions tells us how to compute the derivative of a function $y$ with respect to an individual variable $x_i$ when $y$ is dependent on intermediate variables $u_j$, which are themselves dependent on $\mathbf{x}$. By summing over the contributions of all intermediate variables and using matrix multiplication, we efficiently compute the gradient of $y$ with respect to $\mathbf{x}$. This chain rule is foundational for backpropagation in deep learning and optimization.

The Intuition Behind Summing Derivatives in the Chain Rule

The intuition behind summing the derivatives in the chain rule (i.e., $\frac{\partial y}{\partial x_i} = \sum_{j=1}^{m} \frac{\partial y}{\partial u_j} \frac{\partial u_j}{\partial x_i}$) comes from how multivariate functions depend on each other, especially when there are intermediate variables.

**1. Multivariate Functions are Interdependent**

In deep learning or multivariate calculus, often we are dealing with functions of many variables. For example, if you have a function $y = f(u_1, u_2, \ldots, u_m)$ where each $u_j$ is in turn a function of $\mathbf{x} = [x_1, x_2, \ldots, x_n]$, then $y$ is indirectly affected by $\mathbf{x}$ through the intermediate variables $u_1, u_2, \ldots, u_m$.

So, to compute how $y$ changes when we change just one of the input variables $x_i$, we need to understand how each of these intermediate variables $u_j$ changes with respect to $x_i$.

**2. How Each Intermediate Variable Depends on $x_i$**

Each $u_j$ is a function of the inputs $x_1, x_2, \ldots, x_n$. Thus, when we change $x_i$, each intermediate variable $u_j$ will also change in some way (because $u_j$ depends on $x_i$).

So, we need to compute how much $u_j$ changes with respect to $x_i$, i.e., the partial derivative $\frac{\partial u_j}{\partial x_i}$. This tells us how much each intermediate variable contributes to the change in $y$ when we change $x_i$.

**3. How $y$ Depends on Each $u_j$**

Next, we need to understand how much $y$ changes with respect to each of the intermediate variables $u_j$. This is simply the partial derivative $\frac{\partial y}{\partial u_j}$, which tells us how sensitive $y$ is to a change in each $u_j$.

**4. Combining the Effects**

Now, to find how much $y$ changes with respect to $x_i$, we combine these two effects:
- The effect of changing $x_i$ on $u_j$, which is $\frac{\partial u_j}{\partial x_i}$.
- The effect of changing $u_j$ on $y$, which is $\frac{\partial y}{\partial u_j}$.

Since $y$ depends on all the intermediate variables, we sum the contributions from all the $u_j$'s, which leads to the total derivative:

$\frac{\partial y}{\partial x_i} = \sum_{j=1}^{m} \frac{\partial y}{\partial u_j} \frac{\partial u_j}{\partial x_i}$

Each term $\frac{\partial y}{\partial u_j} \frac{\partial u_j}{\partial x_i}$ represents how much $y$ changes due to a change in $x_i$, passing through the intermediate variable $u_j$.

**5. Why Sum the Contributions?**

The sum appears because each intermediate variable $u_j$ can influence $y$, and each intermediate variable $u_j$ is influenced by $x_i$. So, we need to add the total influence from all intermediate variables to compute the total influence of $x_i$ on $y$.

Think of it like this: If you have several "paths" through which a change in $x_i$ can affect $y$ (each path going through a different $u_j$), you need to account for each of these paths. The sum represents the total effect of all those paths combined.

**Example**

Imagine a simple case where $y = f(u_1, u_2)$, and $u_1 = g_1(x_1, x_2)$ and $u_2 = g_2(x_1, x_2)$.

- First, you compute how $y$ changes with respect to $u_1$ and $u_2$: $\frac{\partial y}{\partial u_1}$ and $\frac{\partial y}{\partial u_2}$.
- Then, compute how each of $u_1$ and $u_2$ changes with respect to $x_1$ and $x_2$: $\frac{\partial u_1}{\partial x_1}, \frac{\partial u_1}{\partial x_2}, \frac{\partial u_2}{\partial x_1}, \frac{\partial u_2}{\partial x_2}$.
- Finally, sum the contributions from both intermediate variables $u_1$ and $u_2$ to find the total derivative of $y$ with respect to $x_1$ or $x_2$.

This is why the chain rule sums the derivatives — because each intermediate variable contributes to the final result, and you need to account for all of them to get the full effect on $y$.