# Supplemental Material
> Written by Ryan Soklaski

## Basics of the Chain Rule

This module deals extensively with composite functions, e.g. $g(f(x))$ (or $(g \circ f)(x)$, using specialized notation). This material introduces a simple method for computing derivatives of composite functions: the so-called chain rule. The chain rule can become unruly from a notational point of view, when using the familiar Liebniz notation for the derivate: $\frac{df}{dx}$. For the moment, let's adopt a functional notation for the derivative: $f'(x)$. That is, $\frac{df}{dx}$ and $f'(x)$ represent exactly the same function - the derivative of $f(x)$. Additionally, let's assume that all of our functions are only single-variable functions, for the time being.

Given that the function $z(x)$ is the composition of the function $g$ with the function $f(x)$:

\begin{equation}
z(x) = (g \circ f)(x)\\
\end{equation}

the **chain rule** states that the derivative of $z$ with respect to $x$ is given by the composition of the function $g'$ with $f(x)$, multiplied by $f'(x)$:

\begin{equation}
z'(x) = g'(f(x)) \cdot f'(x)
\end{equation}

or, equivalently, using the $g \circ f$ notation for composing our functions,:

\begin{equation}
z'(x) = (g' \circ f)(x) \cdot f'(x)
\end{equation}

### Example Calculation Using the Chain Rule
Let's jump to an example immediately to make sure that we are not confused by this notation. Consider the following functions:

\begin{align}
f(x) &= 3x + 1\\
g(x) &= x^2 - 2\\
z(x) = (g \circ f)(x) &= (3x + 1)^2 - 2
\end{align}

The derivatives of $f$ and $g$ are quite simple:
\begin{align}
f'(x) &= 3\\
g'(x) &= 2x\\
z'(x) &= g'(f(x)) \cdot f'(x)
\end{align}

According to the chain rule, this is all we need to compute the derivative of $z(x)$. Recognizing that $g'(f(x)) = 2f(x)$

\begin{equation}
z'(x) = 2f(x) \cdot  f'(x)\\
\end{equation}

Plugging in for $f(x)$ and $f'(x)$:

\begin{align}
z'(x) &= 2(3x + 1) \cdot 3\\
\\
z'(x) &= 18x + 6
\end{align}

As an exercise, write $z(x)$ out in full — as $z(x) = (3x + 1)^2 - 2$ — and take the its derivative directly (first simplify the squared term in the equation). Verify that the result you obtain agrees with the equation for $z'(x)$ that we arrived at by using the chain rule. Review this example carefully, and be sure to have a clear understanding of the symbolic form of the chain rule.

### Representing the Chain Rule Using Leibniz Notation

We will ultimately need to make use of the chain rule generalized to *multivariable* functions. For this, Leibniz notation is extremely valuable. Recall that we write the partial derivative of $f(x,y)$, with respect to $x$, as $\frac{\partial f}{\partial x}$. Return to the supplemental material in Machine Learning Module 2 for an overview of partial derivatives. Let's translate the chain rule into Leibniz notation. 

\begin{align}
z'(x) &\rightarrow \frac{dz}{dx} \\ 
g'(f(x)) &\rightarrow \frac{dg}{df}\Bigr|_{f=f(x)} \\
f'(x) &\rightarrow \frac{df}{dx} \\
z'(x) = g'(f(x)) \cdot f'(x) &\rightarrow \frac{dz}{dx} = \frac{dg}{df}\Bigr|_{f=f(x)}\frac{df}{dx}
\end{align}

$g$ is the only function that depends on another dependent variable - $f$. This is why we use the vertical line to indicate that the derivative of $g$ is to be evaluated using $f(x)$ as its input variable. Because we will always evaluate intermediate derivatives within the chain rule in this fashion, we can forego using the vertical line, and simply remain mindful of the preceding statement. Thus the chain rule, written using Leibniz notation, is:

\begin{equation}
\frac{dz}{dx} = \frac{d(g \circ f)}{dx} = \frac{dg}{df}\frac{df}{dx}
\end{equation}

This is the notation that we will use moving forward, especially as we begin to work with partial derivatives of multi-variable functions. This simple chain rule is also sufficient for generalizing to an arbitratily-long sequence of compositions. See if you can use the former equation to prove that:

\begin{equation}
\frac{d(f_1 \circ f_2 \circ ... \circ f_n)}{dx} = \frac{df_1}{df_2}\frac{df_2}{df_3} ... \frac{df_{n-1}}{df_n}\frac{df_n}{dx}
\end{equation}

where $\frac{df_j}{df_{j+1}}$ is understood to be evaluated at $f_{j+1} \circ... \circ f_n(x)$.

One final note to help clarify the vertical-bar notation used above. If you want to compute the derivative of $z$, evaluated at, say, $x = 2$, you would denote this as:

\begin{equation}
\frac{dz}{dx}\Bigr|_{x=2} = \frac{dg}{df}\Bigr|_{f=f(2)} \frac{df}{dx} \Bigr|_{x=2}
\end{equation}

which, of course, is the same as writing:

\begin{equation}
z'(2) = g'(f(2)) \cdot f'(2)
\end{equation}

To be clear, $z'(2)$ and $\frac{dz}{dx}\Bigr|_{x=2}$ both mean: take the derivative of $z(x)$ and evaluate the resulting function at $x = 2$. It doesn't make sense to take the derivative of $z(2)$.

### Extending the Chain Rule for Multiple Variables
The case of composing a single-variable function with a multivariable one is quite simple for extending the chain rule with partial derivatives:
\begin{align}
z(x, y) &= g(f(x, y)) \\
\frac{\partial z}{\partial x} &= \frac{dg}{df}\frac{\partial f}{\partial x} \\
\frac{\partial z}{\partial y} &= \frac{dg}{df}\frac{\partial f}{\partial y}
\end{align}

The partial derivative of $z$ with respect to $x$ ($y$) is given by the derivative of $g$, evaluated at $f(x, y)$, times the partial derivative of $f$ with respect to $x$ ($y$). Qualitatively, $\frac{\partial f}{\partial x}$ represents the change in $f$ that occurs given a small change in $x$ (holding $y$ fixed); $\frac{dg}{df}$ represents 
the change in $g$ given a small change in $f$. It follows, then, that $\frac{dg}{df}\frac{\partial f}{\partial x}$ represents the change in $g(f(x, y))$, which is  $z$, given a small change in only $x$. This is exactly what $\frac{\partial z}{\partial x}$ represents.

You will also encounter more complicated instances, in which $g$ itself depends on multiple functions of the independent variables: $z(x, y) = g(p(x, y), q(x, y))$. **The following result is very important. Your homework will require an understanding of this.** Here, you simply **accumulate** (i.e. sum) the derivatives that are contributed by $p$ and $q$, respectively:

\begin{align}
z(x, y) &= g(p(x, y), q(x, y)) \\
\frac{\partial z}{\partial x} &= \frac{\partial g}{\partial p}\frac{\partial p}{\partial x} + \frac{\partial g}{\partial q}\frac{\partial q}{\partial x} \\
\frac{\partial z}{\partial y} &= \frac{\partial g}{\partial p}\frac{\partial p}{\partial y} + \frac{\partial g}{\partial q}\frac{\partial q}{\partial y} \\
\end{align}

Again, this can be generalized to accomodate an arbitrary number of dependent variables:
\begin{align}
z(x, y) &= g(f_1(x, y), f_2(x, y), ..., f_n(x, y)) \\
\frac{\partial z}{\partial x} &= \frac{\partial g}{\partial f_1}\frac{\partial f_1}{\partial x} + \frac{\partial g}{\partial f_2}\frac{\partial f_2}{\partial x} + ... + \frac{\partial g}{\partial f_n}\frac{\partial f_n}{\partial x} \\
\frac{\partial z}{\partial y} &= \frac{\partial g}{\partial f_1}\frac{\partial f_1}{\partial y} + \frac{\partial g}{\partial f_2}\frac{\partial f_2}{\partial y} + ... + \frac{\partial g}{\partial f_n}\frac{\partial f_n}{\partial y} \\
\end{align}

This should make sense once dissected — I want to describe how varying $x$ by a small amount affects $z$. Thus I need to know how varying $x$ affects $f_1$ ($\frac{\partial f_1}{\partial x}$), and multiply it with how varying $f_1$ affects $z$ ($\frac{\partial z}{\partial f_1}$). Thus $\frac{\partial z}{\partial f_1}\frac{\partial f_1}{\partial x}$ describes how varying $x$ affects $z$ **via** $f_1$. Repeat this for $f_2$, ..., $f_n$, and sum up all of these contributions to arrive at how varying $x$ affects $z$ in total: $\frac{\partial z}{\partial x}$

### Simple Example 
\begin{align}
z(x, y) &= g(p(x,y), q(x, y))\\
g(p, q) &= p^2 - q^3 \\
p(x, y) &= yx^2 \\
q(x, y) &= 2x + y \\
\end{align}

According to the chain rule provided above, the derivatives needed to compute $\frac{\partial z}{\partial x}$ and $\frac{\partial z}{\partial y}$ are simply:
\begin{align}
\frac{\partial g}{\partial p} &= 2p(x, y) \\
\frac{\partial g}{\partial q} &= -3q(x, y)^2 \\
\frac{\partial p}{\partial x} &= 2yx \\ 
\frac{\partial p}{\partial y} &= x^2 \\
\frac{\partial q}{\partial x} &= 2 \\ 
\frac{\partial q}{\partial y} &= 1
\end{align}

Simply plug these equations into the expression for the chain rule for a function of multiple dependent variables, and you have computed the partial derivatives of $z$ in $x$ and $y$.