In [1]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

## Basic setup

Create anaconda environment
<br>
```bash
conda create -n ml python=3.7.5 jupyter
```
Install fastai library
<br>
```bash
conda install -c pytorch -c fastai fastai
```

## Deep Neural Networks

#### Linear Regression
<br>
$$
F(x) = Wx + b
$$
<br>
where:
$\begin{align}
    W &= (W_{1}, W_{2}, \dots, W_{m} \in \mathbb{R}^{m} (\mathbb{R}^{m \times 1})
    \text{, }
    x &= \begin{pmatrix}
           x_{1} \\
           x_{2} \\
           \vdots \\
           x_{m}
         \end{pmatrix} \in \mathbb{R}^{1 \times m}
         \text{ and }
    b \in \mathbb{R}^{1}
\end{align}$ 

#### Logistic Regression

$$
F(x) = \sigma(Wx + b)
$$

<div>
<img src="images/dnn/logistic_classifier.png"  height="600" width="800" />
</div>

Neural networks can be considered as cascade or pipeline of linear classifiers or regressors. For instance:
Let $X \in \mathbb{R}^m$ be an our data and $Y \in \mathbb{R}^n$ be a classes. Define $H \in \mathbb{R}^K$ and $\phi_{i}:X \to H_i$ is a linear function:
<br>
$$z_{i} = \sum_{j=1}^kW_{i,j}x_j + b_i$$ or 
<br>
$$z_i = W_ix + b$$
<br>
where 
$\begin{align}
    x &= \begin{pmatrix}
           x_{1} \\
           x_{2} \\
           \vdots \\
           x_{m}
         \end{pmatrix} \in \mathbb{R}^{1 \times m}
         \text{, }
    b &= \begin{pmatrix}
           b_{1} \\
           b_{2} \\
           \vdots \\
           b_{n}
         \end{pmatrix} \in \mathbb{R}^{1 \times n}
\end{align}$ 
and $W_i = (W_{i,1}, W_{i,2}, \dots, W_{i,m}) \in \mathbb{R}^{m \times 1}$
<br>

<img src="images/dnn/dnn_logistic.png" alt="Logistic Classifier Deep Neural Network" height="600" width="800" />

$$
f(x) = Wx + b
$$
<br>
here 
$\begin{align}
    x &= \begin{pmatrix}
           x_{1} \\
           x_{2} \\
           \vdots \\
           x_{m}
         \end{pmatrix} \in \mathbb{R}^{1 \times m}
  \text{,    }
    b &= \begin{pmatrix}
           b_{1} \\
           b_{2} \\
           \vdots \\
           b_{n}
         \end{pmatrix} \in \mathbb{R}^{1 \times n}
 \text{ and }
    W &= \begin{pmatrix}
           W_{1, 1}, W_{1, 2} \dots W_{1, m} \\
           W_{2, 1}, W_{2, 2} \dots W_{2, m} \\
           \vdots \\
           W_{n,1}, W_{n, 2} \dots W_{n, m}
         \end{pmatrix} \in \mathbb{R}^{n \times m}
 \end{align}$

Now consider other mapping $\sigma:H \to A$ where $A \in \mathbb{R}^n$
<br>
$$
\begin{align}
   \sigma \colon \begin{pmatrix}
           x_{1} \\
           x_{2} \\
           \vdots \\
           x_{m}
         \end{pmatrix}
    \mapsto \begin{pmatrix}
           \sigma(x_{1}) \\
           \sigma(x_{2}) \\
           \vdots \\
           \sigma(x_{m})
     \end{pmatrix}
     =& \begin{pmatrix}
           a_{1} \\
           a_{2} \\
           \vdots \\
           a_{m}
     \end{pmatrix}
\end{align}
$$

<img src="images/dnn/deep-neural-network-1.jpg" alt="Deep Neural Network" height="600" width="800" />

$$\sigma(x) = \frac{1}{1+e^{-x}}$$
Sigmoid

<img src="images/dnn/sigmoid.png" alt="Sigmoid" height="600" width="800" />

$$\sigma(x)=\frac{1-e^{-x}}{1+e^{-x}}$$
Tahn

<img src="images/dnn/tanh.png" alt="Tanh" height="600" width="800" />

$$\sigma(x) = max(0, x)$$
ReLu

<img src="images/dnn/relu.png" alt="Relu" height="600" width="800" />

$$
f(x) = \begin{cases}
    x & \text{if } x > 0, \\
    0.01x & \text{otherwise}.
\end{cases}
$$
Leaky ReLu

<img src="images/dnn/leaky_relu.png" alt="Leaky ReLU" height="600" width="800" />

## Why DNN

Why deep neural networks?
- Dimensionality reduction
- Multi model (ensemble)
- Features extractor

Still why should they work?
- Needs more data
- Computationaly expensive training and inference
- Black box

In case of kernel methods, linear regression, random forest or gradient boosting, there exists methods for analysis why model should work. But for DNN we don't have such a vivid imagination.

So, why we choose DNN anyway

## Universal Approximation Theorems

#### Theorem (The Universal Approximation Theorem):
<br>
For every $\sigma:\mathbb{R}\to\mathbb{R}$ bounded, and continuous function (called the activation function). Let $I_m$ denote the m-dimensional unit hypercube $[0,1]^m$ The space of real-valued continuous functions on 
$I_{m}$ is denoted by 
$C(I_{m})$. Then, given any $\varepsilon >0$ and any function $f\in C(I_{m})$, there exist an integer $N$, real constants $v_{i},b_{i}\in \mathbb {R}$ and real vectors $w_{i}\in \mathbb {R} ^{m}$ for $i=1,\ldots ,N$, such that we may define:
<br>
$$
F( x ) = \sum_{i=1}^{N} v_i \sigma \left( w_i^T x + b_i\right)
$$
<br>
as an approximate realization of the function $f$; that is,
<br>
$$
|F(x)-f(x)|<\varepsilon
$$
<br>
for all $x\in I_{m}$. In other words, functions of the form $F(x)$ are dense in $\displaystyle C(I_{m})$.

#### Theorem (The Universal Approximation Theorem for any Compact)
<br>
For every $\sigma:\mathbb{R}\to\mathbb{R}$ bounded, and continuous function (called the activation function). Let $K \in \mathbb{R}^m$ denote the any compact in $\mathbb{R}^m$ The space of real-valued continuous functions on 
$K$ is denoted by 
$C(K)$. Then, given any $\varepsilon >0$ and any function $f\in C(K)$, there exist an integer $N$, real constants $v_{i},b_{i}\in \mathbb {R}$ and real vectors $w_{i}\in \mathbb {R} ^{m}$ for $i=1,\ldots ,N$, such that we may define:
<br>
$$
F( x ) = \sum_{i=1}^{N} v_i \sigma \left( w_i^T x + b_i\right)
$$
<br>
as an approximate realization of the function $f$; that is,
<br>
$$
|F(x)-f(x)|<\varepsilon
$$
<br>
for all $x\in I_{m}$. In other words, functions of the form $F(x)$ are dense in $\displaystyle C(K)$.

#### Theorem (Bounded case)
<br>
The universal approximation theorem for width-bounded networks can be expressed mathematically as follows:

For any Lebesgue-integrable function 
$f:\mathbb {R} ^{n}\rightarrow \mathbb {R}$ and any $\epsilon >0$, there exists a fully-connected ReLU network 
$\mathcal {A}$ with width $d_{m}\leq {n+4}$, such that the function 
$F_{\mathcal {A}}$ represented by this network satisfies
<br>
$$ 
\int _{\mathbb {R} ^{n}}\left|f(x)-F_{\mathcal {A}}(x)\right|\mathrm {d} x<\epsilon
$$

## Definitions and Notions

Lets define weights per layer $l$ as $W^l$:
<br>
$$
\begin{align}
    W^l &= \begin{pmatrix}
           W_{1, 1}^l, W_{1, 2}^l \dots W_{1, m^l}^l \\
           W_{2, 1}^l, W_{2, 2}^l \dots W_{2, m^l}^l \\
           \vdots \\
           W_{n^l,1}^l, W_{n^l, 2}^l \dots W_{n^l, m^l}^l
         \end{pmatrix} \in \mathbb{R}^{n^l \times m^l}
 \end{align}
$$
<br>

$$
F(x) = \sigma(W^{L-1}(\dots \sigma(W^2\sigma(W^1x + b^1) + b^2)\dots)) + b^{L-1}
$$

We denote 
$$a^l = \sigma(W^la^{l-1} + b^l)$$

and
<br>
$$
z^l = W^la^{l-1} + b^l
$$

Or
<br>
$$
f^l(a^{l-1}) = W^la^{l-1} +b^l
$$
<br>
The linear function

So we have a $n^{L-1}$ (hyperparameter alarm) dimensional vector

## Different Architectures

Weights sharing:
We can restrict some weights between layer to be equal during the training (they change equally)

<img src="images/dnn/weight_sharing_1.jpg" alt="Weights sharing" height="600" width="800" />

<img src="images/dnn/weight_sharing_2.png" alt="Weights sharing in details" height="600" width="800" />

Example of weights sharing is CNN layers

Residual connections: Have skip connections between (among) layers

<img src="images/dnn/residual_1.jpeg" alt="Residual" height="400" width="600" height="600" width="800" />

<img src="images/dnn/residual_2.png" alt="Residual in details" height="600" width="800" />

<img src="images/dnn/residual_3.png" alt="Residual multi connections" height="600" width="800" />

<img src="images/dnn/residual_4.png" alt="Dense blocks" height="600" width="800" />

Other operations over layers:
UNet and Feature Pyramid Network

<img src="images/dnn/unet.png" alt="UNet" height="600" width="800" />

<img src="images/dnn/fpn.png" alt="FPN" height="600" width="800" />

Recurrent neural networks (RNN):

<img src="images/dnn/rnn.png" alt="RNN" height="600" width="800" />

LSTM, GRU Gates:

<img src="images/dnn/LSTM.png" alt="LSTM" height="600" width="800" />

<img src="images/dnn/gru.png" alt="GRU" height="600" width="800" />

## Loss landscape

On the other hand loss can have a different surfaces, because of many parameters, chance that many of them has the same direction is low:

<div>
<img src="images/opts/loss_ld_1.jpeg"  height="600" width="800" />
</div>

## Loss landscape and optimization

Non smooth surface:

<div>
<img src="images/opts/ls_1.jpeg"  height="600" width="800" />
</div>

Smooth surface:

<div>
<img src="images/opts/ls_2.jpeg"  height="600" width="800" />
</div>

Make step optimization (moving average):

<div>
<img src="images/opts/lsgd_1.png"  height="600" width="800" />
</div>

Make surface optimization (landscape):

<div>
<img src="images/opts/lsgd_2.gif"  height="600" width="800" />
</div>

## Local Extrema vs Saddle Points

According to the nature of neural network as a function, probability to get to the local extrema is very low:
$$
p = 2^{-L}
$$

Or when number of layers is high (in modern architectures):
$$
p = 2^{-160000000}
$$

Which is almost $0$, but probability to get to the suddle point is very very high.

So our task is to get to the good enough (wide) plateau where loss satisfies our needs.

Wide plateau means that model will be more robust and stable for changes.

Because of mini-bach approach, think about wide plateau as a stable for mini-baches and for other changes.

Function as the same for batch training, but for minibatch we have different but similar functions

## Training Deep Neural Networks

#### Reverse mode differentiation

Problem with $\frac{\partial L}{\partial W_{i, j}^{l}}$ (or $\frac{\partial L}{\partial b_{i}^{l}}$)
Modern neural networks has more than 100000000 or even more than 200000000 parameters and hierarchical nature. i

We can consider deep neural network as composition
<br>
$$
F = \sigma \circ f^L \circ \sigma \circ f^{L-1} \circ \dots \circ \sigma \circ f^2 \circ \sigma \circ f^1
$$
<br>

From the compositionality (and chain rule) we have to make the same multiplications multiply times

<img src="images/dnn/composition1.jpg" arc="Composition of functions"  height="600" width="800" />

Reverse mode differentiation

<img src="images/dnn/bp1.png" arc="Composition of functions" height="600" width="800" />

The same problem with DNN

<img src="images/dnn/bp2.gif" arc="Composition of functions" height="600" width="800" />

## Backpropagation

<a href="http://neuralnetworksanddeeplearning.com/">Neural Networks and Deep Learning</a>

#### Hadamard Product

Elementwise product of two matrices $A, B \in \mathbb{R}^{n \times m}$
<br>
$$
C = A \cdot B
$$
<br>
where $C_{i, j} = A_{i, j}B_{i_j}$ and $C \in \mathbb{R}^{n \times m}$

$$
A = 
\begin{pmatrix}
           A_{1, 1}, A_{1, 2} \dots A_{1, m} \\
           A_{2, 1}, A_{2, 2} \dots A_{2, m} \\
           \vdots \\
           A_{n,1}, A_{n, 2} \dots A_{n, m}
\end{pmatrix}
\text{, } 
B = 
\begin{pmatrix}
           B_{1, 1}, B_{1, 2} \dots B_{1, m} \\
           B_{2, 1}, B_{2, 2} \dots B_{2, m} \\
           \vdots \\
           B_{n,1}, B_{n, 2} \dots B_{n, m}
\end{pmatrix}
$$

$$
A \cdot B = 
\begin{pmatrix}
           A_{1, 1}B_{1, 1}, A_{1, 2}B_{1, 2} \dots A_{1, m}B_{1, m} \\
           A_{2, 1}B_{2, 1}, A_{2, 2}B_{2, 2} \dots A_{2, m}B_{2, m} \\
           \vdots \\
           A_{n,1}B_{n,1}, A_{n, 2}B_{n,2} \dots A_{n, m}B_{n,m}
\end{pmatrix}
$$

#### Major Formulas

For formulas below assume that we fixed the input variable $x$ and treat or model and cost function as function of $W$ and $b$ variables

In practice we make our error (gradient) calculation for each $x$ and then mean them

Later we will try to implement everything from scratch using only the NumPy library

Let's denote
$$
\delta^l_j \equiv \frac{\partial C}{\partial z^l_j}
$$

For $l = L$ we'll get
$$
\delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j).
$$

Consider our $\delta^L$ as a vector:
$$
\delta^L = \nabla_a C \odot \sigma'(z^L).
$$

Now, each $\delta^l$ is dependent on previous (backwards) calculated $\delta^{l+1}$

$$
\delta^l_j = \frac{\partial C}{\partial z^l_j}
$$

$$
\delta^l_j = \sum_k \frac{\partial C}{\partial z^{l+1}_k} \frac{\partial z^{l+1}_k}{\partial z^l_j}
$$

$$
\delta^l_j = \sum_k \frac{\partial z^{l+1}_k}{\partial z^l_j} \delta^{l+1}_k
$$

$$
z^{l+1}_k = \sum_j w^{l+1}_{kj} a^l_j +b^{l+1}_k = \sum_j w^{l+1}_{kj} \sigma(z^l_j) +b^{l+1}_k
$$

Differentiate this
<br>
$$
\frac{\partial z^{l+1}_k}{\partial z^l_j} = w^{l+1}_{kj} \sigma'(z^l_j)
$$

So put everithing together and we'll get
<br>
$$
\delta^l_j = \sum_k w^{l+1}_{kj}  \delta^{l+1}_k \sigma'(z^l_j)
$$

So for vectorized form
<br>
$$
\delta^l = (\sum_k w^{l+1}_{k1}  \delta^{l+1}_k \sigma'(z^l_1), \sum_k w^{l+1}_{k2}  \delta^{l+1}_k \sigma'(z^l_2), \dots, \sum_k w^{l+1}_{kp}  \delta^{l+1}_k \sigma'(z^l_p))
$$

Or on the other hand:
$$
\delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l)
$$

We have something similar to feed forward here but in opposite direction 

Let's now turn to the our main point:
How can we calculate $\frac{\partial C}{\partial W^l_{i,j}}$ and $\frac{\partial C}{\partial b^i}$

Turns out:
$$
\frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j
$$

or vectorized form:
$$
\frac{\partial C}{\partial w} = a_{\rm in} \delta_{\rm out}
$$

and:
$$
\frac{\partial C}{\partial b^l_j} = \delta^l_j
$$

or vectorized as well
$$
\frac{\partial C}{\partial b} = \delta
$$

$$
\frac{\partial C}{\partial w^l_{jk}} = \frac{\partial C}{\partial z^l_{j}}\frac{\partial z^l_k}{\partial w^l_{jk}}
$$

$$
\frac{\partial C}{\partial z^l_{j}} = \delta^l_j
$$

#### Proof for Weights

$$
\frac{\partial z^l_k}{\partial w^l_{jk}} = \frac{\partial (\sum_{i=1}^{m^l}{a_i^{l-1}w_{ij}})}{\partial w^l_{jk}} = a_k^{l-1}
$$

So we have $\frac{\partial C}{\partial z^l_{j}} = \delta^l_j$ and $\frac{\partial z^l_k}{\partial w^l_{jk}} = a_k^{l-1}$ by which we conclude:
$$
\frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j
$$

$$
\frac{\partial C}{\partial b^l_j} = \delta^l_j
$$

#### Proof for Biases

Try!!!!!!

## Questions

#### Thank You