## The Chain Rules

We can't compute partial derivatives of very complicated functions using just the basic matrix calculus rules we've seen so far. For example, we can't take the derivative of nested expressions like ![](https://explained.ai/matrix-calculus/images/eqn-DCA4F9F0CE7F7CA365E8B26987ED972A-depth003.25.svg)
directly without reducing it to its scalar equivalent. We need to be able to combine our basic vector rules using what we can call the vector chain rule.
single-variable chain rule, where we want the derivative of a scalar function with respect to a scalar. Then we'll move on to an important concept called the total derivative and use it to define what we'll pedantically call the single-variable total-derivative chain rule.
The chain rule is conceptually a divide and conquer strategy (like Quicksort) that breaks complicated expressions into subexpressions whose derivatives are easier to compute. Its power derives from the fact that we can process each simple subexpression in isolation yet still combine the intermediate results to get the correct overall result.

The chain rule comes into play when we need the derivative of an expression composed of nested subexpressions. For example, we need the chain rule when confronted with expressions like
![](https://explained.ai/matrix-calculus/images/eqn-06D67DC7FE74C1895AEF564F8295E918-depth004.58.svg)
The outermost expression takes the sin of an intermediate result, a nested subexpression that squares x. 

### Single-variable chain rule

Let's start with the solution to the derivative of our nested expression:
![](https://explained.ai/matrix-calculus/images/eqn-AB5ECA885C5685990CD778580665B3A4-depth004.58.svg)
derivatives :
![](https://explained.ai/matrix-calculus/images/eqn-1A32AA532898DEBB80C0C7A818C5C70B-depth004.58.svg) and
![](https://explained.ai/matrix-calculus/images/eqn-DC7C2E5FB11394451EA6A2010904F0B2-depth004.58.svg)

Chain rules are typically defined in terms of nested functions, such as
![](https://explained.ai/matrix-calculus/images/eqn-281717E562B6C04AA861AC9F2801D016-depth003.25.svg)
for single-variable chain rules.

single-variable chain rule :
![](https://explained.ai/matrix-calculus/images/blkeqn-9D3919C42833D1FF1456DEA11D8CC927.svg)

To deploy the single-variable chain rule, follow these steps:

    1.Introduce intermediate variables for nested subexpressions and subexpressions for both binary and unary operators; e.g. element wise binary multiplication is a binary, sin(x) and other trigonometric functions are usually unary because there is a single operand. This step normalizes all equations to single operators or function applications.
    2.Compute derivatives of the intermediate variables with respect to their parameters.
    3.Combine all derivatives of intermediate variables by multiplying them together to get the overall result.
    4.Substitute intermediate variables back in if any are referenced in the derivative equation.
    
let's take an example ![](https://explained.ai/matrix-calculus/images/eqn-872F24FC57CA661E3704C0A10869C6B5-depth003.25.svg)    
1.Introduce intermediate variables. Let u=$x^2$ represent subexpression $x^2$ This gives us:
![](https://explained.ai/matrix-calculus/images/blkeqn-165CA85C4E868C4589FDC97854EF5AFE.svg)

2.Compute derivatives
![](https://explained.ai/matrix-calculus/images/blkeqn-33DB49B9B2BFE622EF83565332547027.svg)

3.Combine
![](https://explained.ai/matrix-calculus/images/blkeqn-9D1B1984635759F4C2D23464EBBAA995.svg)

4.Substitute
![](https://explained.ai/matrix-calculus/images/blkeqn-6E1052BD462233E4BEA04D695437A984.svg)

With deeply nested expressions, it helps to think about deploying the chain rule the way a compiler unravels nested function calls like  into a sequence (chain) of calls.
![](https://explained.ai/matrix-calculus/images/eqn-642FD3E590A2B2D21AFE5254BE8E832F-depth003.25.svg)

1.Introduce intermediate variables.
![](https://explained.ai/matrix-calculus/images/blkeqn-A153933499426CFC383D252C30A87953.svg)
2.Compute derivatives
![](https://explained.ai/matrix-calculus/images/blkeqn-4561212E91367D4B2DCC40262E36921D.svg)
3.Combine four intermediate values.
![](https://explained.ai/matrix-calculus/images/blkeqn-2CF824877C0FB75B7648CE66E56FB509.svg)
4.Substitute
![](https://explained.ai/matrix-calculus/images/blkeqn-9BC49A78C13740AC58294EAA333AF3CC.svg)

Here is a visualization of the data flow through the chain of operations from x to y:
<div>
<img src="https://explained.ai/matrix-calculus/images/chain-tree.png" width="200"/>
</div>

### Single-variable total-derivative chain rule

Our single-variable chain rule has limited applicability because all intermediate variables must be functions of single variables. But, it demonstrates the core mechanism of the chain rule, that of multiplying out all derivatives of intermediate subexpressions. To handle more general expressions such as
$$ y = f(x) =  x + x^2 $$
so derivative ,
![](https://explained.ai/matrix-calculus/images/eqn-18EF15EC4DDA421BE5BC86F0295D36CD-depth004.58.svg)
but that is using the scalar addition derivative rule, not the chain rule. If we tried to apply the single-variable chain rule, we'd get the wrong answer. In fact, the previous chain rule is meaningless in this case because derivative operator
d/dx does not apply to multivariate functions, such as $u_2$ among our intermediate variables:

![](https://explained.ai/matrix-calculus/images/blkeqn-5CB23F92FE51ABF1B1885A985EA61BC6.svg)

Let's try it anyway to see what happens. If we pretend that ![](https://explained.ai/matrix-calculus/images/eqn-400C939294CFCAC13949F5A92DD9537A-depth005.85.svg) and ![](https://explained.ai/matrix-calculus/images/eqn-8B1AAFE58A962E6F06775EBD2808D5FE-depth004.58.svg) then ![](https://explained.ai/matrix-calculus/images/eqn-69A0BF0F0F217A5C8CDB490B4C60ABEE-depth005.85.svg)
instead of the right answer $1+2x$

Because $u_2(x,u) = x + u_1$ has multiple parameters, partial derivatives come into play. Let's blindly apply the partial derivative operator to all of our equations and see what we get:

![](https://explained.ai/matrix-calculus/images/blkeqn-7A7B19296641D8B6B96136527F381589.svg)

Ooops! The partial ![](https://explained.ai/matrix-calculus/images/eqn-9A001AB7445AD6C36176920C0E5D253F-depth004.67.svg) 
is wrong because it violates a key assumption for partial derivatives. When taking the partial derivative with respect to x, the other variables must not vary as x varies. Otherwise, we could not act as if the other variables were constants. Clearly, though, $u_1(x) = x^2$ is a function of x and therefore varies with x.
$\frac{\partial u_2(x,u_1)}{\partial x} \not= 1 + 0$ because $\frac{\partial u_1(x)}{\partial x} \not= 0$

Enter the “law” of total derivatives, which basically says that to compute , we need to sum up all possible contributions from changes in x to the change in y.

$f(x) = u_2(x,u_1)$ that depends on x directly or indirectly via intermediate variable $u_1(x)$ ,is given by
![](https://explained.ai/matrix-calculus/images/blkeqn-09EB861D79D7E60D9B37567CE097631B.svg)
Using this formula, we get the proper answer:
![](https://explained.ai/matrix-calculus/images/blkeqn-217390CDB48372744AC16E8277C9D0CC.svg)

That is an application of what we can call the single-variable total-derivative chain rule:

![](https://explained.ai/matrix-calculus/images/blkeqn-4586E8ADC3AA440DD41501217E7B6E67.svg)

The total derivative assumes all variables are potentially codependent whereas the partial derivative assumes all variables but x are constants.

Let's look at a nested subexpression, such as $f(x) = sin(x + x^2)$ so,three intermediate variables:
![](https://explained.ai/matrix-calculus/images/blkeqn-B5215A178D5EDFCCA280ED63D64A8025.svg)
and partials:
![](https://explained.ai/matrix-calculus/images/blkeqn-C3A02B93F5C0F2C8979E524BB28D35AC.svg)

where both $\frac{\partial u_2}{x}$ and $\frac{\partial f(x)}{x}$ have $\frac{\partial u_i}{x}$ terms that take into account the total derivative.
next $y = x \times x^2$  the total-derivative chain rule formula still adds partial derivative terms.
Here are the intermediate variables and partial derivatives:

![](https://explained.ai/matrix-calculus/images/blkeqn-A1BE672D8676E7C9DE634935C1CDEBBA.svg)

The form of the total derivative remains the same, however:

![](https://explained.ai/matrix-calculus/images/blkeqn-50D1EABA22B46536559D83F8C21F749D.svg)

It's the partials (weights) that change, not the formula, when the intermediate variable operators change.

### Vector chain rule

Let's consider $ y = f(x) $ so, in vector form
![](https://explained.ai/matrix-calculus/images/blkeqn-428F0EFA4C7B2EEC64829258E8DAFF86.svg)

Let's introduce two intermediate variables,  and , one for each fi so that y looks more like $ y = f(g(x)) $
![](https://explained.ai/matrix-calculus/images/blkeqn-5CE597B34A604EDC0DD0B1AD97CFD690.svg)
![](https://explained.ai/matrix-calculus/images/blkeqn-E04563108617169E3793740388785DAB.svg)

The derivative of vector y with respect to scalar x is a vertical vector with elements computed using the single-variable total-derivative chain rule :  
![](https://explained.ai/matrix-calculus/images/blkeqn-63BFCE8E2B4E8F603E209ECA0C1DADCE.svg)

so now we have the answer using just the scalar rules, albeit with the derivatives grouped into a vector. Let's try to abstract from that result what it looks like in vector form. The goal is to convert the following vector of scalar operations to a vector operation.
![](https://explained.ai/matrix-calculus/images/blkeqn-0EA3C72DBB9F820121EE6A27D76EC7CC.svg)

If we split the $\frac{\partial f_i}{\partial g_i} \frac{\partial g_i}{\partial x}$ terms, isolating the $\frac{\partial g_i}{\partial x}$ terms into a vector, we get a matrix by vector multiplication:

![](https://explained.ai/matrix-calculus/images/blkeqn-692581F3416029FED8E1CE09890F4A5E.svg)

That means that the Jacobian is the multiplication of two other Jacobians, which is kinda cool. Let's check our results:

![](https://explained.ai/matrix-calculus/images/blkeqn-4A8689EA58BF9FA2AF675AAE0C093010.svg)

so , same answer as the scalar approach. This vector chain rule for vectors of functions and a single parameter appears to be correct and, indeed, mirrors the single-variable chain rule. Compare the vector rule:

![](https://explained.ai/matrix-calculus/images/blkeqn-B0D7932C93DA81FD62418DA5DF3CBE14.svg)

![](https://explained.ai/matrix-calculus/images/blkeqn-B0D7932C93DA81FD62418DA5DF3CBE14.svg)

with the single-variable chain rule:
![](https://explained.ai/matrix-calculus/images/blkeqn-8617320E088DCA9EC6795865C614324A.svg)

To make this formula work for multiple parameters or vector X, we just have to change x to vector X in the equation. The effect is that $\partial g/ \partial X$ and the resulting Jacobian, $\partial f / \partial X$ are now matrices instead of vertical vectors. Our complete vector chain rule is:

![](https://explained.ai/matrix-calculus/images/blkeqn-63A95E8A883CCB871C2C68B2D8B6EAA4.svg)

The beauty of the vector formula over the single-variable chain rule is that it automatically takes into consideration the total derivative while maintaining the same notational simplicity. The Jacobian contains all possible combinations of fi with respect to gj and gi with respect to xj. For completeness, here are the two Jacobian components in their full glory:

![](https://explained.ai/matrix-calculus/images/blkeqn-875D6B48E0F3610A491D91FA12067AED.svg)

where m = |f| , n = |x| and k = |g| , the resulting jacobian is $m*n$ 

we can simplify further because, for many applications, the Jacobians are square () and the off-diagonal entries are zero. 
![](https://explained.ai/matrix-calculus/images/blkeqn-4869BA6DCF2C9A404EECD993808A74B7.svg)
![](https://explained.ai/matrix-calculus/images/blkeqn-8866C26258F279CD69740D3A26C0CD90.svg)

In this situation, the vector chain rule simplifies to:

![](https://explained.ai/matrix-calculus/images/blkeqn-2D82A2CFEA0A49E9B7D3C8F986DAF14A.svg)

Therefore, the Jacobian reduces to a diagonal matrix whose elements are the single-variable chain rule values.
The following table summarizes the appropriate components to multiply in order to get the Jacobian.

![](https://explained.ai/matrix-calculus/images/latex-17BE59DB8766A07658ADAA8522995C53.svg)


### The gradient of neuron activation

We now have all of the pieces needed to compute the derivative of a typical neuron activation for a single neural network computation unit with respect to the model parameters, w and b:

![](https://explained.ai/matrix-calculus/images/blkeqn-F3041D1B0AB2DA26CFE6581CCE10BF0F.svg)

we have to compute $\frac{\partial (W.X + b)} {\partial W}$ and $\frac{\partial (W.X + b)} {\partial b}$
(Recall that neural networks learn through optimization of their weights and biases.)

The dot product  is just the summation of the element-wise multiplication of the elements: 

![](https://explained.ai/matrix-calculus/images/eqn-4B722CF0EFB8F8ABCBB968086BB587E8-depth003.31.svg)

let's apply chain rule

![](https://explained.ai/matrix-calculus/images/blkeqn-AA40E45F705402308665F4778260405C.svg)

Once we've rephrased y, we recognize two subexpressions for which we already know the partial derivatives:

![](https://explained.ai/matrix-calculus/images/blkeqn-CB86F7761DC5757FCE7D9B440DEB6630.svg)

The vector chain rule says to multiply the partials:

![](https://explained.ai/matrix-calculus/images/blkeqn-735EE304812513469D0BAE8D1D32E578.svg)

To check our results, we can grind the dot product down into a pure scalar function:

![](https://explained.ai/matrix-calculus/images/blkeqn-8E233C707CFA165FFECCE145E88AEB24.svg)

then,
![](https://explained.ai/matrix-calculus/images/blkeqn-44FAB0E50B6FA0C7012FF79FB03EBD14.svg)

Our scalar results match the vector chain rule results.

Now, let y = w.x + b, the full expression within the max activation function call. We have two different partials to compute, but we don't need the chain rule:
![](https://explained.ai/matrix-calculus/images/blkeqn-0F1D53EBF96D7DC463E2226B77812776.svg)

Let's tackle the partials of the neuron activation, $max(0 , w.x+b)$. The use of the $max(0,z)$  function call on scalar z just says to treat all negative z values as 0. The derivative of the max function is a piecewise function. When $z <= 0$ the derivative is 0 because z is a constant. When $z > 0$ the derivative of the max function is just the derivative of z, which is :

![](https://explained.ai/matrix-calculus/images/blkeqn-61BE19B395EB3114577B2100997DFB6D.svg)

An aside on broadcasting functions across scalars. When one or both of the max arguments are vectors, such as , we broadcast the single-variable function max across the elements. This is an example of an element-wise unary operator. Just to be clear:

![](https://explained.ai/matrix-calculus/images/blkeqn-94240FE4B77DEE9DA38F596CD4149F9D.svg)

For the derivative of the broadcast version then, we get a vector of zeros and ones where:
![](https://explained.ai/matrix-calculus/images/blkeqn-5B3555A4691C5DA19688E4F76BA1C3AD.svg)

![](https://explained.ai/matrix-calculus/images/blkeqn-4EA88847E78682CFDBCA0019C1623945.svg)

To get the derivative of the $activation(x)$ function, we need the chain rule because of the nested subexpression, $w.x+b$. Following our process, let's introduce intermediate scalar variable z to represent the affine function giving:

![](https://explained.ai/matrix-calculus/images/blkeqn-261BC49758F84DF99117345CD8D22CFE.svg)

![](https://explained.ai/matrix-calculus/images/blkeqn-9E429AF61D15BF6942A3132FABAC77A2.svg)

The vector chain rule tells us:

![](https://explained.ai/matrix-calculus/images/blkeqn-B0C84C12426A7A698FBBCB890502411F.svg)

which we can rewrite as follows:
![](https://explained.ai/matrix-calculus/images/blkeqn-AE892FF5E074073E025BB3BBE586B9F5.svg)

and then substitute $w.x+b$ back in:

![](https://explained.ai/matrix-calculus/images/blkeqn-A123AAABD8432822C27BEE74393F78AD.svg)

That equation matches our intuition. When the activation function clips affine function output z to 0, the derivative is zero with respect to any weight wi. When $z>0$ 

it's as if the max function disappears and we get just the derivative of z with respect to the weights.

Turning now to the derivative of the neuron activation with respect to b, we get:

![](https://explained.ai/matrix-calculus/images/blkeqn-9CE42F7BD715F354A87DF9043310E3BB.svg)

Let's use these partial derivatives now to handle the entire loss function.

### The gradient of the neural network loss function

Training a neuron requires that we take the derivative of our loss or “cost” function with respect to the parameters of our model, w and b. Because we train with multiple vector inputs (e.g., multiple images) and scalar targets (e.g., one classification per image), we need some more notation. Let

![](https://explained.ai/matrix-calculus/images/blkeqn-D9D9E4DA80EA78BCCCDDC0BE89A198CD.svg)

where N = |X|, and then let
![](https://explained.ai/matrix-calculus/images/blkeqn-F924EBF36B5E655648826C8AE83DE16D.svg)

where yi is a scalar. Then the cost equation becomes:

![](https://explained.ai/matrix-calculus/images/blkeqn-3D04A32BFDCD990F21451D8230C46FB1.svg)

Following our chain rule process introduces these intermediate variables:

![](https://explained.ai/matrix-calculus/images/blkeqn-ED01A1463E7C7657E0DD1546F6C48BFB.svg)

Let's compute the gradient with respect to w first.

### The gradient with respect to the weights

From before, we know:

![](https://explained.ai/matrix-calculus/images/blkeqn-577C100C01A97DEA7FB361169DA383B5.svg)

and 
![](https://explained.ai/matrix-calculus/images/blkeqn-8E94E788919EA6D0BE11C2615A11C009.svg)

Then, for the overall gradient, we get:

![](https://explained.ai/matrix-calculus/images/latex-9F8112A77C51E95057A9E56D29FFB669.svg)

To interpret that equation, we can substitute an error term 
![](https://explained.ai/matrix-calculus/images/eqn-A90A948C72E3B6E10EF49E9CA3323248-depth002.65.svg)  yielding:

![](https://explained.ai/matrix-calculus/images/blkeqn-6F2CDC50A69419550C3127B318DB71CD.svg)

Of course, we want to reduce, not increase, the loss, which is why the gradient descent recurrence relation takes the negative of the gradient to update the current position (for scalar learning rate ):

![](https://explained.ai/matrix-calculus/images/blkeqn-C50B63069D6F8CA47A43A8116F4AD21B.svg)

Because the gradient indicates the direction of higher cost, we want to update x in the opposite direction.

### The derivative with respect to the bias

To optimize the bias, b, we also need the partial with respect to b. Here are the intermediate variables again:

![](https://explained.ai/matrix-calculus/images/blkeqn-18067E5F73988B179A304788A7BC5786.svg)

We computed the partial with respect to the bias for equation $u(w,b,x)$ previously: 

![](https://explained.ai/matrix-calculus/images/blkeqn-97F6B2C0D2E31579875BAE3E458BF333.svg)

For v, the partial is:
![](https://explained.ai/matrix-calculus/images/blkeqn-8C418D6F20C9CD5F0C56184F94005AF3.svg)

And for the partial of the cost function itself we get:

![](https://explained.ai/matrix-calculus/images/latex-D2EB5E709D4D4474EDB3DE6699F91F9A.svg)

As before, we can substitute an error term:

![](https://explained.ai/matrix-calculus/images/blkeqn-2D57432B77DCFDC3D65FC04C9F6621A7.svg)

The partial derivative is then just the average error or zero, according to the activation level. To update the neuron bias, we nudge it in the opposite direction of increased cost:

![](https://explained.ai/matrix-calculus/images/blkeqn-301542C82A1BF05D145392056ADC0AC2.svg)

In practice, it is convenient to combine w and b into a single vector parameter rather than having to deal with two different partials: 

![](https://explained.ai/matrix-calculus/images/eqn-1C5C3BA710F818B84DF992E699DB50C3-depth003.25.svg)  This requires a tweak to the input vector x as well but simplifies the activation function. By tacking a 1 onto the end of x, ![](https://explained.ai/matrix-calculus/images/eqn-CDA92F9769DA156F5D82B4BF0D40A8B4-depth003.25.svg) ,  ![](https://explained.ai/matrix-calculus/images/eqn-6C156670CEC09096976A6722592523F3-depth001.06.svg) becomes ![](https://explained.ai/matrix-calculus/images/eqn-03741D422AA7DF7FF34288D8E4395143-depth000.00.svg)