[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/KingaS03/Introduction-to-Python-2020-June/master)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/KingaS03/Introduction-to-Python-2020-June)

# 3. Calculus

Agenda
- differentiation of univariate functions
- rules of differentiation
- differentiation of multivariate functions (the Jacobian, the Hessian)
- chain rule for univariate and multivariate functions 
- the Taylor approximation
- the Newton-Raphson method
- gradient descent method
- backpropagation


## 3.1. Motivation
Find the optimal value of the model parameters of a neuronal network.

## 3.2. Functions
A function $f:A \to B$ associates to each element of the set $A$ an element of the set $B$.

For our future context $A = \mathbb{R}^n$ and $B = \mathbb{R}^m$ for some natural numbers $n$ and $m$.

### 3.2.1. Differentiation of a univariate function
----------------------------
**Definition of the derivative of a univariate function**<br>

For a function $f:\mathbb{R} \to \mathbb{R}$ we would like to characterise its local linear behavior. Therefore we take two points $x$ and $x+\Delta x$ and their corresponding values $f(x)$ and $f(x+\Delta x)$. We are connecting these points by a line and we will calculate the gradient of this line

$$m = \frac{\Delta f}{\Delta x} = \frac{f(x+\Delta x)-f(x)}{(x+\Delta x) - x} = \frac{f(x+\Delta x)-f(x)}{\Delta x}$$

Now we are going to take smaller and smaller values for the increment $\Delta x$. We define the derivative of $f$ in point $x$ as the value of the above quotient when $\Delta x$ is getting infinitesimally small.

In mathematical formalism the definition of the first order derivative of $f$ looks in the following way:

$$f'(x) = \lim\limits_{\Delta x \to 0} \frac{f(x+\Delta x)-f(x)}{\Delta x}$$

-------------------

In [2]:
from IPython.display import IFrame

IFrame("https://www.geogebra.org/classic/enyhcvgw", 1100, 900)

### 3.2.2. Differentiation rules 

Observe that the above defined derivative satisfies the following properties:

0. constant rule: $c' = 0$, for any constant $c \in \mathbb{R}$ 

1. constant mutiple rule: $(cf(x))' = c f'(x)$, where $c \in \mathbb{R}$ 

2. sum and difference rule: $(f(x) \pm g(x))' = f'(x)\pm g'(x)$

3. product rule: $(f(x) \cdot g(x))' = f'(x)g(x) + f(x)g'(x)$

4. power rule: $\left(x^r\right)' = r x^{r-1}$, where $r \in \mathbb{R}\setminus\{0\}$

5. exponential derivative: $(e^x)' = e^x$, $(a^x)' = ln(a) a^x$, where $a \in (0,\infty) \setminus\{1\}$

6. logarithm derivative: $(\ln(x))' = \frac{1}{x}$, $\log_a(x) = \frac{1}{\ln(a)x}$, where $a \in (0,\infty) \setminus\{1\}$ and $x \neq 0$

7. trigonometric derivatives: $(\sin(x))' = \cos(x)$, $(\cos(x))' = -\sin(x)$, 

8. chain rule: $\left(f(g(x))\right)' = f'(g(x)) \cdot g'(x)$

<!-- <center>
<img src="Images/DifferentiationRules.png" width="500"> 
</center>-->



### 3.2.3. Differentiation of a multivariate function

-------------
**Definition of the partial derivative**

When the function $f:\mathbb{R}^n \to \mathbb{R}$ depends on more variables $x_1, x_2, \ldots, x_n$ and it is nice enough, we can calculate its partial derivatives w.r.t. each variable. The partial derivative of the function $f$ in a point $x^* =(x_1^*, x_2*, \ldots, x_n^*)$ w.r.t. the variable $x_1$ can be calculated by fixing the values of the other parameters to be equal to $x_2^*, \ldots, x_n^*$ and differentiating the so resulting function by its only parameter $x_1$.

To describe the formula in a mathematical exact way let us consider the function $g: \mathbb{R} \to \mathbb{R}$ defined by the formula 

$$g(x_1) = f(x_1, x_2^*, \ldots, x_n^*)$$

Then the partial derivative of $f$ w.r.t. $x_1$ is denoted by $\frac{\partial f}{\partial x_1}$ and is equal to the derivative of $g$ in the point $x_1^*$, that is

$$\frac{\partial f}{\partial x_1}(x_1^*,, x_2^*, \ldots, x_n^*) = g'(x_1^*)$$

---------------

Alternatively we can use for this partial derivative also other notations, like the shorter 

$$\frac{\partial f}{\partial x_1}(x^*) \quad \mbox{or} \quad \partial_{x_1} f(x^*)$$

When it clear that we are performing our calculations in the point $x^*$ and there is no source for confusion, we can omit $x^*$ also and work just with 

$$\frac{\partial f}{\partial x_1} \quad \mbox{or} \quad \partial_{x_1} f$$

We can proceed similarly in the case of the other variables to calculate all partial derivatives

$$\frac{\partial f}{\partial x_2}(x^*), \quad \frac{\partial f}{\partial x_3}(x^*), \quad \ldots \quad, \frac{\partial f}{\partial x_n}(x^*)$$

-------------
**Definition of the Jacobian**

The row vector of all partial derivatives is called the **gradient** of the function or the **Jacobian** of it, that is

$$ \nabla f (x^*) = \left( \frac{\partial f}{\partial x_1} (x^*), \frac{\partial f}{\partial x_2}(x^*), \ldots \frac{\partial f}{\partial x_n}(x^*)\right)$$

------------------


**Further generalisation of the Jacobian**

For a function $f: \mathbb{R}^n \to \mathbb{R}^m$ having also a multivariate output, we can take each output and calculate its partial derivatives w.r.t. each input variable $x_1, x_2, \ldots, x_n$. For the first output we will have $n$ partial derivatives,i.e.

$$\frac{\partial f_1}{\partial x_1}(x^*), \quad \frac{\partial f_1}{\partial x_1}(x^*), \quad \ldots \quad, \frac{\partial f_1}{\partial x_n}(x^*)$$

And for each output the same will happen. We will organise these partial derivatives in a matrix in such a way that in the $i$th row the derivatives of $f_i$ will be enlisted, and at the intersection of $i$th row and $j$th column the derivative 

$$\frac{\partial f_i}{\partial x_j}(x^*)$$

will be stored.

This way we obtain the matrix

$$\nabla f (x^*) = \left(
\begin{array}{cccc}
\frac{\partial f_1}{\partial x_1}(x^*) & \frac{\partial f_1}{\partial x_2}(x^*) & \cdots & \frac{\partial f_1}{\partial x_n}(x^*)\\
\frac{\partial f_2}{\partial x_1}(x^*) & \frac{\partial f_2}{\partial x_2}(x^*) & \cdots & \frac{\partial f_2}{\partial x_n}(x^*)\\
\vdots & \vdots & \ddots & \vdots\\
\frac{\partial f_m}{\partial x_1}(x^*) & \frac{\partial f_m}{\partial x_2}(x^*) & \cdots & \frac{\partial f_m}{\partial x_n}(x^*)
\end{array}
\right)$$

This matrix is called the Jacobian of the function $f$. 

Sometimes for the notation of the above Jacobian matrix the $\frac{\partial f}{\partial x}$ or $\partial_x f$ notations are also used. These latter notations are preferred when the function $f$ might depend on other variables as well and we would like to emphasize w.r.t. which variables do we consider the Jacobian.

---------------

### 3.2.4. Multivariate chain rule


Having introduced the Jacobian, we can formulate the multivariate chain rule. 

-------------
**The multivariate chain rule**

For two functions $f: \mathbb{R}^n \to \mathbb{R}^m$ and $g: \mathbb{R}^m \to \mathbb{k}$ the Jacobian of the compound function $f \circ g$ in the point $x^*$ is:

$$(\nabla f \circ g) (x^*) = (\nabla f)(g(x^*)) \cdot (\nabla g)(x^*)$$

-------------

### 3.2.5. Higher order differentials (uni- and multivariate case)

-----------------
**Definition of higher order differentials / derivatives**

For a function $f: \mathbb{R}\to \mathbb{R}$ we can calculate its derivative in each point, this means that the derivative $f'$ of the function is again a function mapping each point $x \in \mathbb{R}$ to the derivative $f'(x)$.

Now we could differentiate again each the first order derivative $f'$ and as such we get to the second order derivative, i.e.

$$f''(x) = \lim\limits_{\Delta x \to 0}\frac{f'(x+\Delta x) - f'(x)}{\Delta x}$$

The second order derivative can be again differentiated and this way we obtain the $3$rd order derivative of a function denoted by $f'''$ or $f^{(3)}$.

The $n$th order derivative of a function $f: \mathbb{R}\to \mathbb{R}$ in the point $x$ is denoted by $f^{(n)}(x)$ if it exists.

-------------------
**Multivariate case**

We extend the notion of second order derivative to a function $f: \mathbb{R}^n \to \mathbb{R}$.

Consider as starting point the Jacobian of the function (which corresponds to the derivative from the univariate case). Let us calculate all partial derivatives of the first order partial derivatives from 

$$\nabla f = \left( \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \ldots \frac{\partial f}{\partial x_n}\right),$$

and organize them in the following way in a matrix

$$\nabla^2 f = \left(
\begin{array}{cccc}
\frac{\partial^2 f}{\partial x_1\partial x_1} & \frac{\partial^2 f}{\partial x_1\partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1\partial x_n} \\
\frac{\partial^2 f}{\partial x_2\partial x_1} & \frac{\partial^2 f}{\partial x_2\partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_2\partial x_n} \\
\vdots & \vdots & \ddots & \vdots\\
\frac{\partial^2 f}{\partial x_n\partial x_1} & \frac{\partial^2 f}{\partial x_n\partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n\partial x_n} 
\end{array}
\right)$$

then the resulting matrix is called the **Hessian matrix**.

-----------

The value of the Hessian matrix can be used 

- to derive better local approximation for a function than the linear one,
- to find out whether a critical point is a minimum or maximim point or saddle point (exacly as the second order derivative helps us determine whether a critical point is an extreme point of the function).

## 3.3. Applications of the differentials

### 3.3.1. The Taylor series approximation

In [6]:
IFrame("https://www.geogebra.org/classic/kc2umqak", 1000, 800)

--------------
**Definition of the Taylor polynomial of order $n$**

The Taylor polynomial of an $n$-times differentiable function $f:\mathbb{R} \to \mathbb{R}$ in a point $x_0$ is the polynomial $p$ of order $n$, for which it holds that

$$\left\{
\begin{align*}
f(x_0) &= p(x_0)\\
f'(x_0) &= p'(x_0)\\
f''(x_0) &= p''(x_0)\\
\vdots\\
f^{(n)}(x_0) &= p^{(n)}(x_0)
\end{align*}\right.$$

-------
**Remark**<br>
1. Observe that the Taylor polynomial is uniquely defined and it is given by the following formula

$$p(x) = \frac{f(x_0)}{0!} + \frac{f'(x_0)}{1!}(x-x_0) + \frac{f''(x_0)}{2!} + \cdot + \frac{f^{(n)}(x_0)}{n!} (x-x_0)^n$$

Where $0! = 0$ by convention.

2. If the function is nice enough, then the approximation error $f(x) - p(x)$ is of the magnitude of $(x-x_0)^{n+1}$.

---------------------------

**Remark**<br>
The above polinomial has the property that the function value and the first $n$ derivatives of the original function $f$ and the polynomial $p$ are exactly the same in the point $x=x_0$. This polynomial is uniquely defined.

--------------------
**The Taylor approximation of a multivariate function**<br>
For a function $f: \mathbb{R}^n \to \mathbb{R}$ the Taylor approximation of order 1 is

$$l(x) = \frac{f(x_0)}{0!} + \frac{\nabla f(x_0)}{1!}\cdot (x-x_0),$$

where $\nabla f(x_0)$ denotes the Jacobian of the function in point $x_0$ and this row vector is multiplied by the column vector $x-x_0$ in the above formula.

For a function $f: \mathbb{R}^n \to \mathbb{R}$ the Taylor approximation of order 2 is

$$q(x) = \frac{f(x_0)}{0!} + \frac{1}{1!}\nabla f(x_0)\cdot (x-x_0) + \frac{1}{2} (x-x_0)^T \cdot \nabla^2 f(x_0) \cdot (x-x_0),$$

where $\nabla^2 f(x_0)$ denotes the Hessian of the function in point $x_0$ and this matrix  is multiplied from left by the row vector $(x-x_0)^T$ nd from the right by the column vector $x-x_0$ in the above formula.

------------------------

**Remark**<br>
The gradient or Jacobian of the function $f$ has the following two properties, which are crutial for our forthcoming applications:
- in a fixed point $x =(x_1, x_2, \ldots, x_n)$ the gradient/ Jacobian $\nabla f$ points up the hill along the steepest direction
- its length is proportional to the steepness.

*Proof*<br>
In case of a univariate function $f:\mathbb{R} \to \mathbb{R}$, in the proximity of a chosen point $x$, the best linear approximation of the function $f$ is given by the equation

$$y = f(x) + f'(x) \Delta x $$

Therefore the total change of the function $f(x + \Delta x) - f(x)$ can be approximated by $f'(x) \Delta x$.

Similarly in case of a multivariate function $f: \mathbb{R}^n \to \mathbb{R}$ the total change around a point $x$ can be approximated by $\langle \nabla f(x),\Delta x\rangle = \nabla f(x) \cdot \Delta x$, where $\Delta x$ is an element of the tangent space of the surface at the point $x$ expressed w.r.t. the basis $\frac{\partial f(x)}{\partial x_1}, \frac{\partial f(x)}{\partial x_2}, \ldots, \frac{\partial f(x)}{\partial x_n}$. Its compoenents $\Delta x_1, \Delta x_2, \ldots, \Delta x_n$ are called the increments.

Let us recall that 

$$\nabla f(x) \cdot \Delta x = ||\nabla f(x)|| \cdot ||\Delta x|| \cdot \cos(\theta),$$

where $\theta$ is the angle of the vectors $\nabla f(x)$ and $\Delta x$.

The function is the steepest into the direction for which the total change of the function is maximal. Therefore we would like to determine the unit-length vector $\Delta x$ for which $\nabla f(x) \cdot \Delta x$ is maximal. By the previous formula this will be achieved when $\cos(\theta) = 1 \Leftrightarrow \theta = 0$, i.e. $\Delta x$ is the unit vector pointing into the same direction as $\nabla f(x)$, namely 

$$ \Delta x = \frac{\nabla f(x)^T}{||\nabla f(x)||}$$  

By this we have shown that the Jacobian $\nabla f(x)$ is pointing towards the steepest direction up the hill.

Furthermore,
$$\begin{align*}
&f(x+\Delta x) - f(x) \simeq \nabla f(x) \cdot \Delta x\\
&\max_{||\Delta x|| = 1} \nabla f(x) \cdot \Delta x = \nabla f(x) \cdot \frac{\nabla f(x)^T}{||\nabla f(x)||} = \frac{||\nabla f(x)||^2}{||\nabla f(x)||} = ||\nabla f(x)||
\end{align*},$$

which shows that its length is proportional to the steepness.

**Remark**<br>
The contour lines are such lines, where the value of the function stays constant. Most probably you have seen contour lines of peaks or of the sea on old maps.

In the below animation you can move from one contour line to another by changing the value of $z$ on the slider and you can also move the point $A$ on the active contour line. What do you observe? What is the relation between the contour lines and the Jacobian / gradient?


In [6]:
IFrame("https://www.geogebra.org/classic/atrvy2e9", 1200, 800)

The Jacobian vector is ................................. the contour lines.

To prove this also formally consider the following setting. Let $f: \mathbb{R^n} \to \mathbb{R}$ be a function, defining an $n$-dimensional surface in $\mathbb{R}^n$. And let $c:t \to \left(
\begin{array}{c}
c_1(t)\\
c_2(t)\\
\vdots\\
c_n(t)
\end{array}
\right)$ be a curve in the parameterspace of $f$ such that the curve $f(c(t))$ is a contourline, i.e. $f(c(t)) = k$ for some constant $k \in \mathbb{R}$.

The above property follows by differentiating the $f(c(t)) = k$ w.r.t. the variable $t$. By the multivariate chain rule we obtain, that

$$\nabla f(c(t)) \cdot \nabla c(t) = 0,$$

i.e. the Jacobian / gradient of the function $f$ is perpendicular to the tangent of the contour line.

### 3.3.2. The Newton-Raphson method 

The Newton-Raphson method is used to find the approximate a root of a function.

Observe how does it work and identify the steps of the method.

In [8]:
IFrame("https://www.geogebra.org/classic/mm9xvyxr", 800, 600)

The Newton-Raphson method is an iterative method. 

We cosider a function $f:\mathbb{R} \to \mathbb{R}$
    
The purpose of this method is to approximate roots of the function, i.e. such $x$ values for which $f(x) = 0$.

Let us assume that we know the value of the function in a point $x_0$, i.e we know $f(x_0)$. We approximate the behaviour of the function by the tangent line

$$f(x) \simeq l(x) = f(x_0) + f'(x_0)\cdot (x-x_0)$$

and we solve the equation 

$$l(x) = 0$$

The solution of this will be denoted by $x_1$ and by solving the above linear equation we obtain that

$$x_1 = x_0 -  \frac{f(x_0)}{f'(x_0)}$$

$x_1$ is our second approximation for a root of $f$. 

If we continue the process now by constructing the tangent line in $x_1$ and defining the next point as an intersection of the tangent with the $x$-axis, then 

$$x_2 = x_1 - \frac{f(x_0)}{f'(x_0)}$$

will be our third approximation for the root.

If the function is nice enough, then this method converges to a root of the function.

<!--The method can generalised for functions of type $f: \mathbb{R}^n \to \mathbb{R}$. In this setting we start by choosing an $x_0$ and the next value of the sequence that we construct can be determined as

$$x_{n+1} = x_{n} - \lambda \nabla f(x_n) .$$-->

### 3.3.3. Gradient descent method
 
The gradient descent method is similar to the Newton-Raphson one in the sense that we perform an iterative step in the steepest direction. The difference is that the goal of this process is to minimise a cost function $C: \mathbb{R}^n \to \mathbb{R}$ (or $C: \mathbb{R}^n \to \mathbb{R}$ in the multivariate case). We update the gradient in every iterative step and we move along the steepest gradient downwards, i.e.

$$x_{m+1} = x_{m} - \lambda \nabla f(x_m).$$

The parameter $\lambda$ from the above formula is called the **step size** or the **learning rate** of the gradient descent algorithm.

To learn about the different types of gradient descent used in a machine learning context please visit [this link](https://towardsdatascience.com/batch-mini-batch-stochastic-gradient-descent-7a62ecba642a).


In [5]:
IFrame("https://www.geogebra.org/classic/xfa7y3wc", 1200, 800)


### 3.3.4. Backpropagation

To perform the gradient descent method, we need to calculate the Jacobian of the cost function w.r.t. all the parameters of the model. As we have seen in the introduction, a neuronal network can be very complex, but here as a starting point for backpropagation let us consider the following simple network

<center>
<img src="Images/Network4Backpropagation1.png" width="400"> 
</center>

Let us assume that the cost function is the squared error 

$$C = (y^{(1)} - y)^2$$

and let us assume that we have just one training example.

By backpropagation is meant nothing else but the multiple application of the chain rule targeted towards the calculation of the Jacobian of $C$ w.r.t. the model parameters $w^{(1)}$, respectively $b^{(1)}$. Let us write down what leads us from the input values $y^{(0)} = x$ all way to the cost function $C$:

$$
\begin{align*}
&z^{(1)} = b^{(1)} + w^{(1)}y^{(0)}\\
\\
&y^{(1)} = g\left(z^{(1)}\right)\\
\\
&C = \left(y^{(1)} - y\right)^2
\end{align*}
$$

From the above the following formula folds out automatically

$$\mathbf{\frac{\partial C}{\partial w^{(1)}}} = \frac{\partial C}{\partial y^{(1)}} \cdot \frac{\partial y^{(1)}}{\partial w^{(1)}} = \mathbf{\frac{\partial C}{\partial y^{(1)}} \cdot \frac{\partial y^{(1)}}{\partial z^{(1)}} \cdot \frac{\partial z^{(1)}}{\partial w^{(1)}}}$$

We obtain similarly that 

$$\frac{\partial C}{\partial b^{(1)}} = \frac{\partial C}{\partial y^{(1)}} \cdot \frac{\partial y^{(1)}}{\partial z^{(1)}} \cdot \frac{\partial z^{(1)}}{\partial b^{(1)}}$$

-----
**Remark**<br>
For a given activation function $g$ in the above formulae we can calculate every Jacobian

$$\begin{align*}
\frac{\partial C}{\partial y^{(1)}} = 2 \left(y^{(1)} - y\right) \quad \quad
\frac{\partial y^{(1)}}{\partial z^{(1)}} = g'\left(z^{(1)}\right) \quad \quad
\frac{\partial z^{(1)}}{\partial w^{(1)}} = y^{(0)} \quad \quad
\frac{\partial z^{(1)}}{\partial b^{(1)}} = 1
\end{align*}$$

-----

Let us consider the setting of the below more complex neuronal network


<center>
<img src="Images/Network4Backpropagation2.png" width="400"> 
</center>

For this setting the formulae leading from the input to the output can be summarised similarly in matrix form

$$
\begin{align*}
&z^{(1)} = b^{(1)} + w^{(1)}y^{(0)}\\
\\
&y^{(1)} = g_1\left(z^{(1)}\right)\\
\\
&C = ||y^{(1)} - y||^2 = (y^{(1)}_1 - y_1)^2 + (y^{(1)}_2 - y_2)^2 + (y^{(1)}_3 - y_3)^2 = (y^{(1)}-y)^T \cdot (y^{(1)}-y),
\end{align*}
$$

where 

$$
\begin{align*}
y^{(0)} = \left(
\begin{array}{c}
x_1\\
x_2\\
\vdots\\
x_d
\end{array}
\right), \quad 
w^{(1)} = \left(
\begin{array}{cccc}
w_{1,1}^{(1)} & w_{1,2}^{(1)} & \ldots & w_{1,d}^{(1)}\\
w_{2,1}^{(1)} & w_{2,2}^{(1)} & \ldots & w_{2,d}^{(1)}\\
w_{3,1}^{(1)} & w_{3,2}^{(1)} & \ldots & w_{3,d}^{(1)}
\end{array}
\right), \quad
b^{(1)} = \left(
\begin{array}{c}
b_1^{(1)}\\
b_2^{(1)}\\
b_3^{(1)}
\end{array}
\right), \quad
z^{(1)} = \left(
\begin{array}{c}
z_1^{(1)}\\
z_2^{(1)}\\
z_3^{(1)}
\end{array}
\right), \quad
y^{(1)} = \left(
\begin{array}{c}
g_1\left(z_1^{(1)}\right)\\
g_1\left(z_2^{(1)}\right)\\
g_1\left(z_3^{(1)}\right)
\end{array}
\right), \quad
y = \left(
\begin{array}{c}
y_1\\
y_2\\
y_3
\end{array}
\right).
\end{align*}
$$

As a consequence for this more complex neuronal network the desired Jacobians can be expressed by similar relations as before, namely

$$
\begin{align*}\frac{\partial C}{\partial w^{(1)}} = \frac{\partial C}{\partial y^{(1)}} \cdot \frac{\partial y^{(1)}}{\partial z^{(1)}} \cdot \frac{\partial z^{(1)}}{\partial w^{(1)}}\\
\frac{\partial C}{\partial b^{(1)}} = \frac{\partial C}{\partial y^{(1)}} \cdot \frac{\partial y^{(1)}}{\partial z^{(1)}} \cdot \frac{\partial z^{(1)}}{\partial b^{(1)}}
\end{align*}
$$

-----
**Remark**<br>
For a given activation function $g$ in the above formulae we can calculate every Jacobian

$$\begin{align*}
&\frac{\partial C}{\partial y^{(1)}} = 2 \left(y^{(1)} - y\right)^T = 2 \left(y^{(1)}_1 - y_1, y^{(1)}_2 - y_2, y^{(1)}_3 - y_3 \right)\\
\\
&\frac{\partial y^{(1)}}{\partial z^{(1)}} = 
\left(\begin{array}{ccc}
g_1'\left(z_1^{(1)}\right) & 0 & 0\\
0 & g_1'\left(z_2^{(1)}\right) & 0\\
0 & 0 & g_1'\left(z_3^{(1)}\right)
\end{array}
\right)\\
\\
&\frac{\partial z^{(1)}}{\partial w^{(1)}} = 
\left(
\begin{array}{ccc}
\left(y^{(0)}\right)^T & 0_{1\times d} & 0_{1 \times d}\\
0_{1\times d} & \left(y^{(0)}\right)^T & 0_{1\times d}\\
0_{1\times d} & 0_{1\times d} & \left(y^{(0)}\right)^T
\end{array}
\right)\\
\\
&\frac{\partial z^{(1)}}{\partial b^{(1)}} = I_3
\end{align*}$$

-----

<!--
Speed vs time, tangent -> acceleration
Is the speed as function the derivative of some other function? Related to the integral, antiderivative. Distance vs time.

Geometrical def. of the derivative: "rise over run" gradient - for two points on the graph of the function

the gradient of the tangent line

Formal definition of the derivative with $\Delta x$, $f$ and $\lim$.

Ex. Calculate the derivative of a linear function.
Ex. Calculate the derivative of a parabolic function / polynom of grade 2.

Sum rule, power rule.

Special functions and their derivatives: 1/x, e^x (the only function being equal to its derivative), 

Product rule. Quotient rule can be derived also from the product rule and whenever the quotient rule should be used, one can use equivalently the product rule as well.

## Chain law / rule

## Multivariate differentiation
dependent and independent variables, how do we select them? speed can be depicted as the function of time, but not the other way around.

partial differentiation - fix all the variables except one constant and calc. the derivative w.r.t. the remaining variable

Chain rule for the multivariate setup.

## The Jacobian 
The Jacobian vector of a mutivariate function. - the vector pointing to the steepest slope. Contour lines with Jacobian directions.

The Jacobian for vector valued functions.

## Looking for extremal values of a function. 
Context: y = f(x), z = f(x,y)

Sandpit. Find the deepest point by Jacobians, by the depth of the pit.

## The Hessian - the Jacobian of the Jacobian
Hessian - shows whether the found stationary point is a min or max point.

## In reality we don't know the function
We estimate also the Jacobians. What should be the stepsize? We calculate the approx. of the Jac. for different step sizes and we average out.-->

<!--## Total derivative
When our function depends on n variables and the variables depend on the same parameter t. 
$$\frac{{\rm d}f}{{\rm d}t} = \frac{\partial f}{\partial x} \cdot \frac{{\rm d} x}{{\rm d} t} = J_f \frac{{\rm d} x}{{\rm d} t}$$-->

