## Matrix Calculus

When we move from derivatives of one function to derivatives of many functions, we move from
the world of vector calculus to **matrix calculus**. Let’s compute partial derivatives for two functions,
both of which take two parameters.

For example take 

![](https://explained.ai/matrix-calculus/images/eqn-D6DEAE7E403381C2C425D4B40CCA936E-depth003.25.svg) 

and ![](https://explained.ai/matrix-calculus/images/eqn-D182AD2135D1E887AFFCA045F432B2CA-depth003.25.svg)

The gradient for g has two entries, a partial derivative for each parameter:

![](https://explained.ai/matrix-calculus/images/blkeqn-FDB56AA8804E0D13E1555DB8E0E1AAEE.svg)
![](https://explained.ai/matrix-calculus/images/blkeqn-13AF8214DD5A2040D650C7B460C88129.svg)

so, gradient is ![](https://explained.ai/matrix-calculus/images/eqn-E2670D9705180E731C7455A4B46B7AF6-depth003.25.svg)

Gradient vectors organize all of the partial derivatives for a specific scalar function. If we have two
functions, we can also organize their gradients into a matrix by stacking the gradients. When we
do so, we get the Jacobian matrix (or just the Jacobian) where the gradients are rows:

![](https://explained.ai/matrix-calculus/images/blkeqn-B45AD8AF1574CD63AE6980B44770D643.svg)

Note that there are multiple ways to represent the Jacobian. We are using the so-called numerator layout but many papers and software will use the denominator layout. This is just transpose of the numerator layout Jacobian (flip it around its diagonal):

![](https://explained.ai/matrix-calculus/images/blkeqn-B5113497453A60E25E3241A14CC582C3.svg)

### Derivatives with matrices

#### Matrix-by-scalar

The derivative of a matrix function Y by a scalar x is given (in numerator layout notation) by
![](https://wikimedia.org/api/rest_v1/media/math/render/svg/968417a1863243432c63633a4636c4bd94008643)

#### Scalar-by-matrix
The derivative of a scalar y function of a p×q matrix X of independent variables, with respect to the matrix X, is given (in numerator layout notation) by
![](https://wikimedia.org/api/rest_v1/media/math/render/svg/7f4596c2ed35b0cee2ee78dc35e723728a169bd1)


## Generalization of the Jacobian

So far, we've looked at a specific example of a Jacobian matrix. To define the Jacobian matrix more generally, let's combine multiple parameters into a single vector argument: 
![](https://explained.ai/matrix-calculus/images/eqn-8C56090E55CDB76D1CD0E738EBA7F164-depth003.25.svg)

Lowercase letters in bold font such as x are vectors and those in italics font like x are scalars. xi is the  element of vector x and is in italics because a single vector element is a scalar. We'll assume that all vectors are vertical by default of size :
![](https://explained.ai/matrix-calculus/images/blkeqn-D76C868C669197F65B05E96473454834.svg)

With multiple scalar-valued functions, we can combine them all into a vector just like we did with the parameters. Let  be a vector of m scalar-valued functions that each take a vector x of length ![](https://explained.ai/matrix-calculus/images/eqn-E3114C625CDDDC18ED29BA629242BD65-depth003.25.svg) where ![](https://explained.ai/matrix-calculus/images/eqn-DEA8E196A572D082201CD5ABF2FA82DE-depth003.25.svg) is the cardinality (count) of elements in x. Each fi function within f returns a scalar just as in the previous section:

![](https://explained.ai/matrix-calculus/images/blkeqn-CD6121D27CD89157BF272E5E50AE32FE.svg)

It's very often the case that ![](https://explained.ai/matrix-calculus/images/eqn-A193EBC083B4745370F6F1343383D9CC-depth000.14.svg) because we will have a scalar function result for each element of the x vector. For example, consider the identity function ![](https://explained.ai/matrix-calculus/images/eqn-23225C9E5521B6A9777579BE4B92245C-depth003.25.svg) :

![](https://explained.ai/matrix-calculus/images/blkeqn-4BAF672444FD71616154DE2BE79A5DD6.svg)

So we have m=n functions and parameters, in this case. Generally speaking, though, the Jacobian matrix is the collection of all ![](https://explained.ai/matrix-calculus/images/eqn-FBFEB9C8459FEE5A2BD529C07B881153-depth001.08.svg) possible partial derivatives (m rows and n columns), which is the stack of m gradients with respect to x:

![](https://explained.ai/matrix-calculus/images/blkeqn-C6F45926C0FEAD3BD359AA24A7FB23A2.svg)

The Jacobian of the identity function f(x)=x, with fi(x)=xi, has n functions and each function has n parameters held in a single vector x. The Jacobian is, therefore, a square matrix since m=n:

![](https://explained.ai/matrix-calculus/images/latex-229DDEF6A61228EE3F98CD129BBF9663.svg)

## Derivatives of vector element-wise binary operators

Element-wise binary operations on vectors, such as vector addition w+x, are important because we can express many common vector operations, such as the multiplication of a vector by a scalar, as element-wise binary operations. By “element-wise binary operations” we simply mean applying an operator to the first item of each vector to get the first item of the output, then to the second items of the inputs for the second item of the output, and so forth. This is how all the basic math operators are applied by default in numpy or tensorflow, for example. Examples that often crop up in deep learning are
max(w,x) and w>x (returns a vector of ones and zeros).

We can generalize the element-wise binary operations with notation ![](https://explained.ai/matrix-calculus/images/eqn-E2E59AE84EE7A5B1C905E50FA7753A31-depth003.25.svg)  where ![](https://explained.ai/matrix-calculus/images/eqn-5D19E7D8CDFE53DEB40F29D8936E6C89-depth003.25.svg) (Reminder: |x| is the number of items in x.)
The ![](https://explained.ai/matrix-calculus/images/eqn-1A74909B6CBAA4532A76D83B72C12DE0-depth002.52.svg) symbol represents any element-wise operator (such as ) and not the  function composition operator. Here's what equation ![](https://explained.ai/matrix-calculus/images/eqn-E2E59AE84EE7A5B1C905E50FA7753A31-depth003.25.svg) looks like when we zoom in to examine the scalar equations: ![](https://explained.ai/matrix-calculus/images/blkeqn-ADD1230AE3E64A1B7FA77851BB1F07A1.svg)

where we write n (not m) equations vertically to emphasize the fact that the result of element-wise operators give m=n sized vector results.

Using the ideas from the last section, we can see that the general case for the Jacobian with respect to w is the square matrix:

![](https://explained.ai/matrix-calculus/images/blkeqn-54F95B3CFFD404740FAD218B308DEF70.svg)

and the Jacobian with respect to x is:

![](https://explained.ai/matrix-calculus/images/blkeqn-2693C112F589CD0E26853EAD5ED36CFD.svg)

And Jacobian is very often a diagonal matrix, a matrix that is zero everywhere but the diagonal. Because this greatly simplifies the Jacobian, let's examine in detail when the Jacobian reduces to a diagonal matrix for element-wise operations.

In a diagonal Jacobian, all elements off the diagonal are zero, ![](https://explained.ai/matrix-calculus/images/eqn-8B14469E98630C19F16578F90C45F62E-depth007.21.svg) where ![](https://explained.ai/matrix-calculus/images/eqn-B064F8555EC660F2F8BDC927D9636A06-depth002.72.svg)

(Notice that we are taking the partial derivative with respect to wj not wi.) Under what conditions are those off-diagonal elements zero? Precisely when fi and gi are contants with respect to wj,
![](https://explained.ai/matrix-calculus/images/eqn-44C0281EC00A2C0E93E4E3863EE9083D-depth007.21.svg)

Regardless of the operator, if those partial derivatives go to zero, the operation goes to zero,  no matter what, and the partial derivative of a constant is zero.

Those partials go to zero when fi and gi are not functions of wj. We know that element-wise operations imply that fi is purely a function of wi and gi is purely a function of xi. For example, w+x sums wi+xi Consequently, ![](https://explained.ai/matrix-calculus/images/eqn-45553270D20A27EBD4AAE84292606CDD-depth003.25.svg)  reduces to
![](https://explained.ai/matrix-calculus/images/eqn-5F2A3B3A730ABED47918785C5EBF5039-depth003.25.svg) and the goal becomes ![](https://explained.ai/matrix-calculus/images/eqn-767EBC7C8A1785557E38FD32A10FB123-depth007.21.svg)

fi(wi) and gi(xi) look like constants to the partial differentiation operator with respect to wj when  ![](https://explained.ai/matrix-calculus/images/eqn-B064F8555EC660F2F8BDC927D9636A06-depth002.72.svg) so the partials are zero off the diagonal. 

Under this condition, the elements along the diagonal of the Jacobian are ![](https://explained.ai/matrix-calculus/images/eqn-C196BF268027E86D3D2420C2A205AF28-depth005.92.svg) :

![](https://explained.ai/matrix-calculus/images/blkeqn-4CFC6C644E4A95B5760435C5094BE095.svg)

More succinctly, we can write:
![](https://explained.ai/matrix-calculus/images/blkeqn-3D114C6873F46EE41AF91BF8B1BB37CD.svg)
and
![](https://explained.ai/matrix-calculus/images/blkeqn-78E1B2628221D9FD588A011D54670619.svg)
where diag(x) constructs a matrix whose diagonal elements are taken from vector x.

For example, vector addition w+x fits our element-wise diagonal condition because f(w)+g(x) has scalar equations ![](https://explained.ai/matrix-calculus/images/eqn-85550E18F87E4AF75645A38273B97A80-depth003.25.svg)
with partial derivatives:

![](https://explained.ai/matrix-calculus/images/blkeqn-A5DC35FE0CEC748B35BB5991933C4698.svg)
![](https://explained.ai/matrix-calculus/images/blkeqn-C2B0BF832F19994D832F90C18B1F04AF.svg)
That gives us ![](https://explained.ai/matrix-calculus/images/eqn-2935E504F134B53B2C03072175BCCD1F-depth004.67.svg) 
the identity matrix, because every element along the diagonal is 1. I represents the square identity matrix of appropriate dimensions that is zero everywhere but the diagonal, which contains all ones.

Given the simplicity of this special case, fi(w) reducing to fi(wi), you should be able to derive the Jacobians for the common element-wise binary operations on vectors:

![](https://explained.ai/matrix-calculus/images/blkeqn-0C9CE28C888576E0D4873BDD69BC74EA.svg)
![](https://explained.ai/matrix-calculus/images/blkeqn-10B3C04502114250E4A74A1EB5F27F05.svg)
The ![](https://explained.ai/matrix-calculus/images/eqn-790C76CEB13E928D08EDC53D7AC4BB5C-depth001.08.svg) and ![](https://explained.ai/matrix-calculus/images/eqn-98FF0549EBE322C195C2B36FD5EEAD33-depth001.08.svg) operators are element-wise multiplication and division.

##  Derivatives involving scalar expansion

When we multiply or add scalars to vectors, we’re implicitly expanding the scalar to a vector
and then performing an element-wise binary operation. For example, adding scalar z to vector x,
y = x + z, is really ![](https://explained.ai/matrix-calculus/images/eqn-6307CEA088D2D4E98E5B163B9CE8F510-depth002.33.svg)
where ![](https://explained.ai/matrix-calculus/images/eqn-3ABE4A0471143ABFC180C9FA485E5F0A-depth003.25.svg) and ![](https://explained.ai/matrix-calculus/images/eqn-BD2335FC4BBF16BE9590D2501CE8C030-depth003.25.svg)

(The notation ![](https://explained.ai/matrix-calculus/images/eqn-C2C0146718E407005D0C74774C5C5FFC-depth000.00.svg) represents a vector of ones of appropriate length.)
z is any scalar that doesn’t depend on x, which is useful
because then ∂z/∂xi = 0 for any xi and that will simplify our partial derivative computations.

Similarly, multiplying by a scalar, y = xz is really ![](https://explained.ai/matrix-calculus/images/eqn-93A461AB49FD151E602D9344358732CD-depth003.25.svg) z where ⊗ is the element-wise multiplication of the two vectors.

The partial derivatives of vector-scalar addition and multiplication with respect to vector x use our
element-wise rule:
![](https://explained.ai/matrix-calculus/images/blkeqn-F8F0A8F213DE2D87CB4F0C88B2CE8F4C.svg)

Using the usual rules for scalar partial derivatives, we arrive at the following diagonal elements of the Jacobian for vector-scalar addition:
![](https://explained.ai/matrix-calculus/images/blkeqn-6B4EFB58ED5F9EBD623321FE1975FA4E.svg)
![](https://explained.ai/matrix-calculus/images/eqn-5107D6819CDC50A8988D3EA0FB9B94CE-depth004.67.svg)

Computing the partial derivative with respect to the scalar parameter z, however, results in a vertical vector, not a diagonal matrix. The elements of the vector are:

![](https://explained.ai/matrix-calculus/images/blkeqn-5B07C016CBBA27F3E3650DA92BF06A24.svg)
theredfore, ![](https://explained.ai/matrix-calculus/images/eqn-F05F04525E4A1B8959DE54DC7C692060-depth005.22.svg)

The diagonal elements of the Jacobian for vector-scalar multiplication involve the product rule for scalar derivatives:

![](https://explained.ai/matrix-calculus/images/blkeqn-3E6008F6437DA818B481A79FD47D38E4.svg)
![](https://explained.ai/matrix-calculus/images/eqn-3F231096152DFB321FAC62F57A808C35-depth004.67.svg)

The partial derivative with respect to scalar parameter z is a vertical vector whose elements are:

![](https://explained.ai/matrix-calculus/images/blkeqn-C1C1AD1A9E7A6ECCAAACE952B8355BFB.svg)

This gives us ![](https://explained.ai/matrix-calculus/images/eqn-82B897A0A23D9C6BB66EDF17D1D3CB02-depth005.22.svg)

## Vector sum reduction

Summing up the elements of a vector is an important operation in deep learning, such as the
network loss function, but we can also use it as a way to simplify computing the derivative of
vector dot product and other operations that reduce vectors to scalars.

Let ![](https://explained.ai/matrix-calculus/images/eqn-DC5FB5DC7AEB54D8C206744EED4AD748-depth003.31.svg) Notice we were careful here to leave the parameter as a vector x because each function fi could use all values in the vector, not just xi. The sum is over the results of the function and not the parameter. The gradient (1xn Jacobian) of vector summation is:

![](https://explained.ai/matrix-calculus/images/blkeqn-40C0C67E5948039B40D9718ECC2858AE.svg)

Let's look at the gradient of the simple y = sum(x). The function inside the summation is just  and the gradient is then:

![](https://explained.ai/matrix-calculus/images/blkeqn-B6DF766D85C48FF7434B8FFF7BEC9410.svg)
Because ![](https://explained.ai/matrix-calculus/images/eqn-9B32CEB09BA5F0B84935A24BF81D3C9C-depth007.21.svg)
for j not equal to i, we can simplify to:
![](https://explained.ai/matrix-calculus/images/blkeqn-9D07CB9DFE389978D329A8CFE7568825.svg)
Notice that the result is a horizontal vector full of 1s, not a vertical vector, and so the gradient is i-transpose

As another example, let’s sum the result of multiplying a vector by a constant scalar. If y = sum(xz)
then ![](https://explained.ai/matrix-calculus/images/eqn-A5D03DE59DCDBA948F463FAABD04791D-depth003.25.svg). The gradient is:

![](https://explained.ai/matrix-calculus/images/blkeqn-4C28E3734FC6110AF58C567604ED3462.svg)

The derivative with respect to scalar variable z is 1x1:

![](https://explained.ai/matrix-calculus/images/blkeqn-709809A15FF63948512A3F83DF9F04EA.svg)
