In [1]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

## Basic setup

Create anaconda environment
<br>
```bash
conda create -n ml python=3.7.5 jupyter
```
Install fastai library
<br>
```bash
conda install -c pytorch -c fastai fastai
```

## Deep Neural Networks

Neural networks can be considered as cascade or pipeline of linear classifiers or regressors. For instance:
Let $X \in \mathbb{R}^m$ be an our data and $Y \in \mathbb{R}^n$ be a classes. Define $H \in \mathbb{R}^K$ and $f_{i}:X \to H_i$ is a linear function:
<br>
$$f_{i} = \sum_{j=1}^kW_{i,j}x_j + b_i$$ or 
<br>
$$f_i = W_ix + b$$
<br>
where 
$\begin{align}
    x &= \begin{pmatrix}
           x_{1} \\
           x_{2} \\
           \vdots \\
           x_{m}
         \end{pmatrix} \in \mathbb{R}^{1 \times m}
  \end{align}$, 
$\begin{align}
    b &= \begin{pmatrix}
           b_{1} \\
           b_{2} \\
           \vdots \\
           b_{n}
         \end{pmatrix} \in \mathbb{R}^{1 \times n}
  \end{align}$ 
and $W_i = (W_{i,1}, W_{i,2}, \dots, W_{i,m}) \in \mathbb{R}^{m \times 1}$
<br>
$$
f(x) = Wx + b
$$
<br>
here 
$\begin{align}
    x &= \begin{pmatrix}
           x_{1} \\
           x_{2} \\
           \vdots \\
           x_{m}
         \end{pmatrix} \in \mathbb{R}^{1 \times m}
  \end{align}$, 
 $\begin{align}
    b &= \begin{pmatrix}
           b_{1} \\
           b_{2} \\
           \vdots \\
           b_{n}
         \end{pmatrix} \in \mathbb{R}^{1 \times n}
  \end{align}$
 and
$\begin{align}
    W &= \begin{pmatrix}
           W_{1, 1}, W_{1, 2} \dots W_{1, m} \\
           W_{2, 1}, W_{2, 2} \dots W_{2, m} \\
           \vdots \\
           W_{n,1}, W_{n, 2} \dots W_{n, m}
         \end{pmatrix} \in \mathbb{R}^{n \times m}
 \end{align}$

Now consider other mapping $\sigma:H \to A$ where $A \in \mathbb{R}^n$

$$\sigma(x) = \frac{1}{1+e^{-x}}$$
Sigmoid

$$\sigma(x)=\frac{1-e^{-x}}{1+e^{-x}}$$
Tahn

$$\sigma(x) = max(0, x)$$
ReLu

## Why DNN

Why deep neural networks?
- Dimesionality
- Multi model (enssembple)
- Features extractor

Still why should they work?
- Needs more data
- Computationaly expensive training and inference
- Black box

In case of kernel methods, linear regression, random forest or gradient boosting, exists analysis why model should work. But for DNN we don't have such a vivid imagination.

## Universal Approximation Theorems

#### Theorem (The Universal Approximation Theorem):
<br>
For every $\sigma:\mathbb{R}\to\mathbb{R}$ bounded, and continuous function (called the activation function). Let $I_m$ denote the m-dimensional unit hypercube $[0,1]^m$ The space of real-valued continuous functions on 
$I_{m}$ is denoted by 
$C(I_{m})$. Then, given any $\varepsilon >0$ and any function $f\in C(I_{m})$, there exist an integer $N$, real constants $v_{i},b_{i}\in \mathbb {R}$ and real vectors $w_{i}\in \mathbb {R} ^{m}$ for $i=1,\ldots ,N$, such that we may define:
<br>
$$
F( x ) = \sum_{i=1}^{N} v_i \sigma \left( w_i^T x + b_i\right)
$$
<br>
as an approximate realization of the function $f$; that is,
<br>
$$
|F(x)-f(x)|<\varepsilon
$$
<br>
for all $x\in I_{m}$. In other words, functions of the form $F(x)$ are dense in $\displaystyle C(I_{m})$.

#### Theorem (The Universal Approximation Theorem for any Compact)
<br>
For every $\sigma:\mathbb{R}\to\mathbb{R}$ bounded, and continuous function (called the activation function). Let $K \in \mathbb{R}^m$ denote the any compact in $\mathbb{R}^m$ The space of real-valued continuous functions on 
$K$ is denoted by 
$C(K)$. Then, given any $\varepsilon >0$ and any function $f\in C(K)$, there exist an integer $N$, real constants $v_{i},b_{i}\in \mathbb {R}$ and real vectors $w_{i}\in \mathbb {R} ^{m}$ for $i=1,\ldots ,N$, such that we may define:
<br>
$$
F( x ) = \sum_{i=1}^{N} v_i \sigma \left( w_i^T x + b_i\right)
$$
<br>
as an approximate realization of the function $f$; that is,
<br>
$$
|F(x)-f(x)|<\varepsilon
$$
<br>
for all $x\in I_{m}$. In other words, functions of the form $F(x)$ are dense in $\displaystyle C(K)$.

#### Theorem (Bounded case)
<br>
The universal approximation theorem for width-bounded networks can be expressed mathematically as follows:

For any Lebesgue-integrable function 
$f:\mathbb {R} ^{n}\rightarrow \mathbb {R}$ and any $\epsilon >0$, there exists a fully-connected ReLU network 
$\mathcal {A}$ with width $d_{m}\leq {n+4}$, such that the function 
$F_{\mathcal {A}}$ represented by this network satisfies
<br>
$$ 
\int _{\mathbb {R} ^{n}}\left|f(x)-F_{\mathcal {A}}(x)\right|\mathrm {d} x<\epsilon
$$

## Definitions and Notions

Lets define weights per layer $l$ as $W^l$:
<br>
$$
\begin{align}
    W^l &= \begin{pmatrix}
           W_{1, 1}^l, W_{1, 2}^l \dots W_{1, m^l}^l \\
           W_{2, 1}^l, W_{2, 2}^l \dots W_{2, m^l}^l \\
           \vdots \\
           W_{n^l,1}^l, W_{n^l, 2}^l \dots W_{n^l, m^l}^l
         \end{pmatrix} \in \mathbb{R}^{n^l \times m^l}
 \end{align}
$$
<br>

$$
F(x) = \sigma(W^{L-1}(\dots \sigma(W^2\sigma(W_i^1x + b^1) + b^2)\dots)) + b^{L-1}
$$

We denote 
$$a^l = \sigma(W^la^{l-1} + b^l)$$

So we have a $n^{L-1}$ (hyperparameter alarm) dimensional vector

## Different Architectures

Weights sharing:


Residual connections:

Recurrent neural networks (RNN):

LSTM, GRU Gates:

## Training Deep Neural Networks

Problem with $\frac{\partial L}{\partial W_{i, j}^{l}}$