In [None]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

## Basic setup

Create anaconda environment
<br>
```bash
conda create -n ml python=3.7.4 jupyter
```
Install fastai library
<br>
```bash
conda install -c pytorch -c fastai fastai
```

# Derivatives

Let $U \subset \mathbb{R}$ be a open subset and $f:U \subset \mathbb{R} \to \mathbb{R} $:

Function has a derivative at $x_0 \in \mathbb{R}$ if for every $h \in \mathbb{R}$ there exists the limit of $$\lim_{h \to 0}\frac{f(x_0 + h) - f(x_0)}{h}$$
<br>
This limit point is called derivate at point $x_0$ and denoted by $f'(x_0)$

![SegmentLocal](images/calc2/derivat1.gif "segment")

![SegmentLocal](images/calc2/derivat2.gif "segment")

Using the property of limits we can write: 
$$\lim_{h\to 0}\frac{f(a+h) - f(a))}{h} = f'(a)$$ 
then 
$$\lim_{h\to 0}\frac{f(a+h) - f(a))}{h} - f'(a) = 0$$
$$\lim_{h\to 0}\frac{f(a+h) - f(a))}{h} - \lim_{h\to 0} f'(a) = 0$$
which implies: $$\lim_{h\to 0}\frac{f(a+h) - f(a) - h \cdot f'(a)}{h} = 0$$
and 
$$
\lim_{h\to 0}\frac{f(a+h) - (f(a) + f'(a)\cdot h)}{h} = 0
$$
We can consider thet:
$$
f(a+h) \approx f(a) + f'(a)h
$$

If function is differentiable at each point of $U$ then we can define derivative function $f':U \subset \mathbb{R} \to \mathbb{R}$ as $f':x \mapsto f'(x)$ 

Example: $f(x) = x^a$ then $f'(x) = ax^{a-1}$ for any $a \in \mathbb{R}$
<br>
Example: If $f(x) = a$ is constant function then $f'(x) = 0$

Properties of derivatives:
- $(f + g)'(x) = f'(x) + g'(x)$
- $(af(x))' = af'(x)$
- $(fg)'(x) = f'(x)g(x) + f(x)g'(x)$

The chain rule: Let $f:\mathbb{R} \to \mathbb{R}$ and $g:\mathbb{R} \to \mathbb{R}$ then for $g \circ f:\mathbb{R} \to \mathbb{R}$ derivative $(g \circ f)'(x) = g'(f(x)) f'(x)$ and for derivative function we have: $(g \circ f)' = g'_{f}f'$

Example: $(g \circ f)(x) = (ax - b)^2$ then we have $f(x) = ax - b$ and $g(y) = y^2$ and for composition we can consider $ax - b$ as single variable and we get $2(ax - b)$

## Partial derivatives

Let now $f:\mathbb{R}^n \to \mathbb{R}^m$ be a function between two normed vector spaces

The partial derivative of an function $f$ ($x_1, \dots, x_i$) in the direction $x_i$ at the point ($a_1, \dots, a_n$) is defined to be:

$$
\frac{\partial f}{\partial x_i}(a_1, \ldots, a_n) = \lim_{h \to 0}\frac{f(a_1, \ldots, a_i+h,\ldots,a_n) - f(a_1,\ldots, a_i, \dots,a_n)}{h}
$$

All the variables are fixed except $x_i$. That choice of fixed values determines a function of one variable

$$f_{a_1,\ldots,a_{i-1},a_{i+1},\ldots,a_n}(x_i) = f(a_1,\ldots,a_{i-1},x_i,a_{i+1},\ldots,a_n)$$

and by definition,

$$\frac{df_{a_1,\ldots,a_{i-1},a_{i+1},\ldots,a_n}}{dx_i}(a_i) = \frac{\partial f}{\partial x_i}(a_1,\ldots,a_n)$$

Consider a vector $f(x) \in \mathbb{R}^m$ for some $x\in \mathbb{R}^n$ and function $f:\mathbb{R}^n \to \mathbb{R}^m$, then $f(x) = (f_{1}(x), f_{2}(x), \dots, f_{m}(x)) $ then partial derivative can be considered as:
<br>
$$\frac{\partial f}{\partial x_i}(a_1, \ldots, a_n) = \frac{\partial f}{\partial x_i}(a) = \lim_{h \to 0}\frac{1}{h}(f(a_1, \ldots, a_i+h,\ldots,a_n) - f(a_1,\ldots, a_i, \dots,a_n))$$
<br>
So following the rules of vectors:
$$\frac{\partial f}{\partial x_i}(a_1, \ldots, a_n) = (\lim_{h \to 0}\frac{f_{1}(a_1, \ldots, a_i+h,\ldots,a_n) - f_{1}(a_1,\ldots, a_i, \dots,a_n)}{h}, \lim_{h \to 0}\frac{f_{2}(a_1, \ldots, a_i+h,\ldots,a_n) - f_{2}(a_1,\ldots, a_i, \dots,a_n)}{h}, \dots, \lim_{h \to 0}\frac{f_{m}(a_1, \ldots, a_i+h,\ldots,a_n) - f_{m}(a_1,\ldots, a_i, \dots,a_n)}{h})
$$
<br>

Define partial derivatives $$\mathbf{J} = \begin{pmatrix}
    \dfrac{\partial \mathbf{f}}{\partial x_1} & \cdots & \dfrac{\partial \mathbf{f}}{\partial x_n} \end{pmatrix}$$

So for each coordinante of dunctions values we then get matrix:
<br>
$$\mathbf{J} = 
\begin{pmatrix}
    \dfrac{\partial f_1}{\partial x_1} & \cdots & \dfrac{\partial f_1}{\partial x_n}\\
    \vdots & \ddots & \vdots\\
    \dfrac{\partial f_m}{\partial x_1} & \cdots & \dfrac{\partial f_m}{\partial x_n}
\end{pmatrix}
$$

Example: 
$$f(x_1, x_2, x_3) = (2x_{1}^{2} + x_2, 0.5x_{3} + 4)$$ then we have a 
$$f : \mathbb{R}^{?} \to \mathbb{R}^{?}$$
and
$$\mathbf{J} = 
\begin{pmatrix}
    ?\\
    \vdots & \ddots & \vdots\\
    ?
\end{pmatrix}
$$

Example: 
$$f(x_1, x_2, x_3) = (2x_{1}^{2} + x_2, 0.5x_{3} + 4)$$ 
and
$$g(x_1, x_2) = (x_{1}^{4}, 10x_{1}^{3} + 8x_{2}^{2})$$
then we have a 
$$(g \circ f) : \mathbb{R}^{?} \to \mathbb{R}^{?}$$
and
$$\mathbf{J} = 
\begin{pmatrix}
    ?\\
    \vdots & \ddots & \vdots\\
    ?
\end{pmatrix}
$$

Jacobian of compoisition:
$f : \mathbb{R}^n \to \mathbb{R}^n$ and $g : \mathbb{R}^m \to \mathbb{R}^k$ then chain rule generalizes for Jacobian matrices, 
$$\mathbf{J}_{\mathbf{g} \circ \mathbf{f}}(\mathbf{x}) = \mathbf{J}_{\mathbf{g}}(\mathbf{f}(\mathbf{x})) \mathbf{J}_{\mathbf{f}}(\mathbf{x})
$$
for $x \in \mathbb{R}^n$

## Directional derivative

Let for open subset $U$ of $\mathbb{R}^n$ and function $f: U \subset \mathbb{R}^n \to \mathbb{R}$  ($f(\mathbf{x}) = f(x_1, x_2, \ldots, x_n)$) and for eny vector $\mathbf{v} \in \mathbb{R}^n$ define:

$$\nabla_{\mathbf{v}}{f}(\mathbf{x}) = \lim_{h \rightarrow 0}{\frac{f(\mathbf{x} + h\mathbf{v}) - f(\mathbf{x})}{h}}$$

## Total derivative

Define generalization derivative of function $f:U \subset \mathbb{R}^n \to \mathbb{R}^m$ defined of open $U$ in manner:
The function $f$ is differenciable in $x \in U$ if there exists
$$df_a:\mathbb{R}^n \rightarrow \mathbb{R}^m$$ liner transformation such that:
$$
\lim_{x\rightarrow a}\frac{\|f(x)-f(a)-df_a(x-a)\|}{\|x-a\|}=0
$$

As in case of derivatives we can consider that:
$$f(a + h) \approx f(a) + df_a(h)$$

Property: The function is differentiable if and only if each $f_i \colon U \to \mathbf{R}$ is differentiable

Property: If function is differentiable then it has all partial derivatives $\partial f/\partial x_i$ for each $i \in \{1, \dots , n\}$

If $\partial f/\partial x_i$ for each $i \in \{1, \dots , n\}$ and are continuous in some neighborhood of $x_i$ then there exists total derivative and it corresponds to the Jacobian matrix.

#### Example:
$$f(x,y) = \begin{cases}x & \text{if }y \ne x^2 \\ 0 & \text{if }y = x^2\end{cases}$$

#### Example:
$$f(x,y) = \begin{cases}y^3/(x^2+y^2) & \text{if }(x,y) \ne (0,0) \\ 0 & \text{if }(x,y) = (0,0)\end{cases}$$

## Gradient of the function

Considre the function $f:\mathbb{R}^n \to \mathbb{R}$ and consider Jacobian matrix for partial derivatives:
$$
\mathbf{J} = \begin{pmatrix} \frac{\partial f}{\partial x_1}, & \cdots & , & \frac{\partial f}{\partial x_n} \end{pmatrix}
$$

We call this vector the gradient of the function and denote it by $\nabla{f}$ or by $(\nabla f)_{x_0}$ for gradient in point $x_0 \in \mathbb{R}^n$

Gradient in $x_0$ can be considered as liner approximation of the function in point $x_0$
$$f(x) \approx f(x_0) + (\nabla f)_{x_0}\cdot(x-x_0)$$

Gradient and directional derivative:
$$\big(\nabla f(x)\big)\cdot \mathbf{v} = D_{\mathbf v}f(x)$$

Gradient and differential - if function is differentiable in x
$$(\nabla f)_x\cdot v = df_x(v)$$

Properties:
- $\nabla\left(\alpha f+\beta g\right)(a) = \alpha \nabla f(a) + \beta\nabla g (a)$
- $\nabla (fg)(a) = f(a)\nabla g(a) + g(a)\nabla f(a)$
Chain rule:
- $(f\circ g)'(c) = \nabla f(a)\cdot g'(c)$


#### Function surface and gradient vector

<img src="images/calc2/multivariate_function_gradient.jpg">

Gradient will point to the steepest slope and it's magnitude gives is the rate of increase in that direction

$$\big(\nabla f(x)\big)\cdot \mathbf{v} = D_{\mathbf v}f(x)$$

## Local extremums

We say thet point $x_0 \in U$ is a local minima of the function $f:U \subset \mathbb{R}^n \to \mathbb{R}$ if there exists such $r \in \mathbb{R}_+$ such that $f(x) \ge f(x_0)$
for each $x \in \mathbf{B}(x_0, r)$ or for each $∣∣x−x_0∣∣<r$

In the same manner we can define local maxima

Let for $f:U \subset \mathbb{R}^n \to \mathbb{R}$ fix all variables exept $x_i$ and consider function $f(x_i)|_{x_1, x_2, \dots, x_{i-1}, x_{i+1}, \dots x_i}: U_i \subset \mathbb{R} \to \mathbb{R}$

The point $x_0$ is local extrema if and only if it is for each projection

First is easy, for the second we can take the $r = min \{r_i | i \in (1, 2, \dots, n)\}$

## Local minima example

<img src="images/calc2/extreme_points_local_minimas.jpg">

## Local maxima example

<img src="images/calc2/extreme_points_local_maximas.jpg">

## Gradient directions

#### Gradient always points to the steepest point of function

If we consider projection of the cgradient on arbitrary unit vector $\nabla{f} \cdot u = ||\nabla{f}|| \cdot ||u|| \cdot \cos{\alpha}$ for unit vector we have that $||u|| = 1$ and thus $\nabla{f} \cdot u = ||\nabla{f}|| \cdot \cos{\alpha}$ and this is maximal when $\cos{\alpha} = 1$ or when unit and gradient vectors have the same directions

## Fermat's theorem

If function $f\colon (a,b) \rightarrow \mathbb{R}$ has a local extrema at some point $x_0 \in (a,b)$ and $f$ is differentiable at $x_0$ then $f'(x_0) = 0$

For gradients we have the same if $f\colon U \subset \mathbb{R}^n \to \mathbb{R}$ and has a local extrema at some point $x_0 \in U$ and $f$ is differentiable at $x_0$ then $\nabla{f}(x_0) = 0$

## Saddle points

<img src="images/calc2/saddle_point.jpg">

At this points gradient is also zero

## Gradient descent

The question is how can we find local minima point of a function of hundreds of million parameters

We first calculate the gradient of this function, we know that it points to the local maxima so we can take inverce of the gradient and step by step change variables in this direction 

In order to speed up our descent we can multiply the onverse of the gradient on some positive real number $\alpha \in \mathbb{R}_+$

In one dimensional case

$\Delta x_0 =x_0 -\alpha \cdot f'(x_0)$

<img src="images/calc2/single_variable_function_gradient_directions.jpg">

So we calculate $\Delta X =X_0 -\alpha \cdot \nabla{f}(X_0)$

In multidimensional case

<img src="images/calc2/multivariate_function_gradient_directions.jpg">

Gradient descent

![SegmentLocal](images/calc2/gradient_descent_first_order.gif "gradient descent")

Gradient descent speed depends on "smoothness" of the surface of function

![SegmentLocal](images/calc2/gradient_descent_first_order_2.gif "gradient descent")

For
$$f: U \subset \mathbb{R}^n \to \mathbb{R}$$
Define
$$\nabla{f}: U \subset \mathbb{R}^n \to \mathbb{R}^n$$

<img src="images/calc2/vectorf1.jpg"/>

## Saddle point

<img src="images/calc2/saddle_point.jpg">

## Hessian matrix of continuous functions

$$
\mathbf H = \begin{bmatrix}
  \dfrac{\partial^2 f}{\partial x_1^2} & \dfrac{\partial^2 f}{\partial x_1\,\partial x_2} & \cdots & \dfrac{\partial^2 f}{\partial x_1\,\partial x_n} \\[2.2ex]
  \dfrac{\partial^2 f}{\partial x_2\,\partial x_1} & \dfrac{\partial^2 f}{\partial x_2^2} & \cdots & \dfrac{\partial^2 f}{\partial x_2\,\partial x_n} \\[2.2ex]
  \vdots & \vdots & \ddots & \vdots \\[2.2ex]
  \dfrac{\partial^2 f}{\partial x_n\,\partial x_1} & \dfrac{\partial^2 f}{\partial x_n\,\partial x_2} & \cdots & \dfrac{\partial^2 f}{\partial x_n^2}
\end{bmatrix}
$$

## Analysis of Hessian matrix, Schwartz’s theorem

## Tailor’s series

$$f(a)+\frac {f'(a)}{1!} (x-a)+ \frac{f''(a)}{2!} (x-a)^2+\frac{f'''(a)}{3!}(x-a)^3+ \cdots$$

$$\sum_{n=0} ^ {\infty} \frac {f^{(n)}(a)}{n!} (x-a)^{n}.$$

<img src="images/calc2/taylor_series_sin_approximation.jpeg">

## Newton’s method

![SegmentLocal](images/calc2/newton_method.gif "gradient descent")

![SegmentLocal](images/calc2/newton_method_2.gif "gradient descent")

## Reverse mode differentiation

For instance $f:U\subset \mathbb{R}^n \to \mathbb{R}$
<br>
$$f = f_{L} \circ f_{L-1} \circ \dots \circ f_{1}$$
where
$$f_{i}(X) = \sigma(W^{i}X^{T} + b)$$
$W^{i} \in \mathbb{R}^{n_i \times n_{i+1}}$, $X^{i} \in \mathbb{R}^{1  \times n^{i+1}}$ and $b \in \mathbb{R}$

And we want to calculate the gradient with respect to $$W = (W^{1}, W^{2}, \dots , W^{L})$$

# Reverse Mode Differentiation

$$
\frac{\partial y}{\partial x}
= \frac{\partial y}{\partial w_{n-1}} \frac{\partial w_{n-1}}{\partial x}
= \frac{\partial y}{\partial w_{n-1}} \left(\frac{\partial w_{n-1}}{\partial w_{n-2}} \frac{\partial w_{n-2}}{\partial x}\right)
= \frac{\partial y}{\partial w_{n-1}} \left(\frac{\partial w_{n-1}}{\partial w_{n-2}} \left(\frac{\partial w_{n-2}}{\partial w_{n-3}} \frac{\partial w_{n-3}}{\partial x}\right)\right)
= \cdots
$$

<img src="images/calc2/reverse_mode_differentiation.png">

differentiation in reverse order and cache the results once

<img src="images/calc2/reverse1.png">

Then for instant values we have

<img src="images/calc2/reverse2.png">

$$
\frac{\partial}{\partial a}(a+b) = \frac{\partial a}{\partial a} + \frac{\partial b}{\partial a} = 1
$$
<br>
$$
\frac{\partial}{\partial u}uv = u\frac{\partial v}{\partial u} + v\frac{\partial u}{\partial u} = v
$$

<img src="images/calc2/reverse3.png">