# Introduction to deep learning
Benny Avelin
<p><a href="https://commons.wikimedia.org/wiki/File:Colored_neural_network.svg#/media/File:Colored_neural_network.svg">
<center><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/4/46/Colored_neural_network.svg/1200px-Colored_neural_network.svg.png" width=500px alt="Colored neural network.svg">
        </a>
        </center>
        <br>
        <font size="1">By <a href="//commons.wikimedia.org/wiki/User_talk:Glosser.ca" title="User talk:Glosser.ca">Glosser.ca</a> - <span class="int-own-work" lang="en">Own work</span>, Derivative of <a href="//commons.wikimedia.org/wiki/File:Artificial_neural_network.svg" title="File:Artificial neural network.svg">File:Artificial neural network.svg</a>, <a href="https://creativecommons.org/licenses/by-sa/3.0" title="Creative Commons Attribution-Share Alike 3.0">CC BY-SA 3.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=24913461">Link</a>
    </p>
    </font>

# Definitions (Skip)
$\newcommand{\mathbb{R}}{\mathbb{R}}$
$\newcommand{\mathbb{E}}{\mathbb{E}}$
$\newcommand{\H}{\mathcal{H}}$
$\newcommand{\VCdim}{\text{VC-dim}(\H)}$

# Overview (session 4)

* CNNs
* ImageNet challenge and different winners
* Autoencoders
* Reinforcement learning

# There are many different problems that NNs attempt to solve
* Image classification

* Object location

* Natural language processing

* Image restoration

* Fingerprinting

# CNNs
* When images or audio are involved Convolutional Neural Networks are almost exclusively used.

* Convolutional neural networks are an extension of the Neural Networks that we have defined.

* We replace the vector dot-product with linear operators that are discrete convolutional operators.

# CNNs
* Consider an $a \times b$ convolutional kernel (matrix), $K$.

* Consider the input $X$ being $n \times m$ matrix (think image)
$$
    (K \ast X)_{ij} = \sum_{k=1,l=1}^{a,b} K_{kl} X_{i-k,j-l}
$$
for $i = a+1,\ldots, n$, $j=b+1, \ldots, m$.

* This is called `padding=valid` convolution.

* There are dimension preserving convolutions (`padding = same`), this is done by extending $X$ as identically zero.

* For one-dimension (usually time series) there is a `padding=causal`, such that it does not convolve with the future.

# How does this become a network?
* A single artificial neuron can then be represented as
$$
    h(x) = \sigma(w \cdot x + b)
$$

* Replacing the dot-product $w \cdot x$ with the convolution and instead of considering $x \in \mathbb{R}^d$ we consider $X \in \mathbb{R}^{n}\times \mathbb{R}^m$.
* $\sigma$ is applied componentwise as before.

# CNNs

* A single CNN neuron can then be represented as
$$
    h(x) = \sigma(K \ast X + b)
$$
$b$ is a number added to each component.

* But this is single input!

* We define the input dimensions with how many `channels` the input has

* Thus real input has shape $X \in \mathbb{R}^n \times \mathbb{R}^m \times \mathbb{R}^d$.

# Image data
* Image data has this form, the first two coordinates 
signifies the pixels

* The third coordinate is the `channel`, normally this is the colors, `RGB`, Red, Green, Blue.

* Thus the full definition of a single CNN neuron with $d$ input-channels becomes
$$
    h(x) = \sigma\left (\sum_{i=1}^d K_i \ast X_i + b \right )
$$
where $X_i$ is the $i$:th channel, and $K_i$ is the $i$:th channel kernel.

# What does the kernels look like?
* AlexNet (Krizhevsky, Sutskever, Hinton, 2012)
<img src="AlexNet.jpeg" width=200%>

# AlexNet 2012
#### Krizhevsky, Sutskever, Hinton
<img src="AlexNetTopology.png" width=900px>
* Considered to be the most influential paper for using GPUs to train deep cnn networks.

# ImageNet Large Scale Visual Recognition Challenge
## (ILSVRC or ImageNet)
* Over 14 Million hand annotated images, more than 20,000 categories.
* Runs each year since 2010.
<p><a href="https://commons.wikimedia.org/wiki/File:ImageNet_error_rate_history_(just_systems).svg#/media/File:ImageNet_error_rate_history_(just_systems).svg"><center><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/4/4f/ImageNet_error_rate_history_%28just_systems%29.svg/1200px-ImageNet_error_rate_history_%28just_systems%29.svg.png" width=500px alt="ImageNet error rate history (just systems).svg"></center></a><br>By <a href="//commons.wikimedia.org/w/index.php?title=User:Gkrusze&amp;action=edit&amp;redlink=1" class="new" title="User:Gkrusze (page does not exist)">Gkrusze</a> - <span class="int-own-work" lang="en">Own work</span>, <a href="https://creativecommons.org/licenses/by-sa/4.0" title="Creative Commons Attribution-Share Alike 4.0">CC BY-SA 4.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=69750373">Link</a></p>

# 2014 winner, GoogLeNet
#### Szegedy, Liu, Jia, Sermanet, Reed, Anguelov, Erhan, Vanhoucke, Rabinovich
![GoogLeNet](googlenet_diagram.png)
* VGG16 was runner up

# 2015 winner, ResNet 
#### He, Zhang, Ren, Sun
<center>
    <img src="ResNetBlock.png">
</center>

# ResNet
<img src="ResNet.png" widht=900px>

# ResNet
<center>
    <img src="ResNetBlock.png" width=400px>
</center>

* Call $y_{n-1}$ as the input
* Call $\mathcal{F}$ as the `weight+relu+weight`
* Call $\sigma$ the ReLU
$$
    y_n = \sigma (y_{n-1} + \mathcal{F}(y_{n-1}))
$$

# ResNet
* this is almost a discrete ODE, lets remove the outer ReLU
$$
    y_n-y_{n-1} = \mathcal{F}(y_{n-1})
$$

* Euler discretization of the ODE
$$
    \dot{y_t} = \mathcal{F}(y_t)
$$

# NeuralODEs (2018)
#### Chen, Rubanova, Bettencourt, Duvenaud
$$
    \dot{y_t} = f(y_t,t,\theta)
$$
* $f$ is the network
* $y$ is the ODE solution
* $\theta$ are the parameters

* Problem is that if we discretize this and want to compute the gradient, the memory requirements scales with the number of time steps!
* Their idea, use an adjoint ODE to compute the gradient of the loss, constant memory req.

# Autoencoders
<p><a href="https://commons.wikimedia.org/wiki/File:Autoencoder_structure.png#/media/File:Autoencoder_structure.png"><img src="https://upload.wikimedia.org/wikipedia/commons/2/28/Autoencoder_structure.png" alt="Autoencoder structure.png"></a><br>By <a href="//commons.wikimedia.org/w/index.php?title=User:Chervinskii&amp;action=edit&amp;redlink=1" class="new" title="User:Chervinskii (page does not exist)">Chervinskii</a> - <span class="int-own-work" lang="en">Own work</span>, <a href="https://creativecommons.org/licenses/by-sa/4.0" title="Creative Commons Attribution-Share Alike 4.0">CC BY-SA 4.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=45555552">Link</a></p>

# Autoencoders
* Nonlinear "compression" or "projection" or "sparse representation".

* Linear activations or Sigmoid+1 Hidden -> strong correlation to PCA, (Plaut, 2018)

> "The weights of an autoencoder with a single hidden layer of size $p$ (where $p$ is less than the size of the input) span the same vector subspace as the one spanned by the first $p$ principal components, and the output of the autoencoder is an orthogonal projection onto this subspace."

# Autoencoders
* Denoising Autoencoder (can be used for fingerprinting)

* Sparse Autoencoder (Large hidden layer but sparse penalization)

* Variational Autoencoder (Variational Bayesian approach, approximates the posterior, strong assumptions)

# Reinforcement learning
* In its basic form its modeled as a *Markov decision process*

<div class="row">
  <div class="col-md-8" markdown="1">
      <ul>
          <li>Environment and agent state $S$</li>
          <li>$P_a(s,s') = P(s_{t+1}=s' | s_t = s, a_t = a)>0$, transition probabilities</li>
          <li>$R_a(s,s')$ immediate reward from being in state $s$ taking action $a$ and ending up in state $s'$.</li>
          <li>$\pi: S \to A$ the policy</li>
      </ul>
  </div>
  <div class="col-md-4" markdown="1">
      <!-- ![Alt Text](../img/folder/blah.jpg) -->
      <!-- <img height="600px" class="center-block" src="../img/folder/blah.jpg"> -->
      <p><a href="https://commons.wikimedia.org/wiki/File:Reinforcement_learning_diagram.svg#/media/File:Reinforcement_learning_diagram.svg"><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/1b/Reinforcement_learning_diagram.svg/1200px-Reinforcement_learning_diagram.svg.png" alt="Reinforcement learning diagram.svg" widht=200px class='center-block'></a><br><small>By <a href="//commons.wikimedia.org/w/index.php?title=User:Megajuice&amp;action=edit&amp;redlink=1" class="new" title="User:Megajuice (page does not exist)">Megajuice</a> - <span class="int-own-work" lang="en">Own work</span>, <a href="http://creativecommons.org/publicdomain/zero/1.0/deed.en" title="Creative Commons Zero, Public Domain Dedication">CC0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=57895741">Link</a></small></p>
  </div>
</div>

# Q-Learning (Watkins 1989)

* In Q-Learning we do not know the transition probabilities so we need to experiment and learn it from experimentation.

$$
    V^\pi(s) = \mathbb{E} \left [ \sum_{t} \gamma^t R_{\pi(s_t)}(s_t,s_{t+1}) \right ], \quad s_0=s
$$
given $s_t$ we know that $s_{t+1}$ is a random variable with distribution $P_{\pi(s_t)}(s_t,\cdot)$, and $\gamma < 1$ is a discount factor.
* For a given policy the sequence of states $s_t$ starting in $s$ is a Markov process.

# Q-Learning
$$
    V^\ast(s) = \max_{\pi} V^{\pi}(s)
$$

We define for a given $s$ and $a$, 
$$
    Q^\ast(s,a) = \max_\pi Q^\pi(s,a)
$$
where $Q^\pi(s,a)$ is the expected reward starting at $s$ taking action $a$ and then following $\pi$.

$$
    V^\ast(s) = \max_a Q^\ast(s,a)
$$

The Bellman equation for MDPs states that
$$
    Q^\ast(s,a) = \mathbb{E}_{s' \sim P_a(s,\cdot)}[R_a(s,s') + \gamma \max_{a'} Q^\ast(s',a')]
$$
where $s'$ is the state reached after taking action $a$ in state $s$.

* Note the policy depends on the starting position $s$, but our assumption that the transition probability was positive now removes this dependency.

# Q-Learning
* Start with an initial guess for the function $Q: \mathcal{S} \times \mathcal{A} \to \mathbb{R}$
* Iterate as follows
$$\tiny{\displaystyle Q^{new}(s_{t},a_{t})\leftarrow (1-\alpha )\cdot \underbrace {Q(s_{t},a_{t})} _{\text{old value}}+\underbrace {\alpha } _{\text{learning rate}}\cdot \overbrace {{\bigg (}\underbrace {r_{t}} _{\text{reward}}+\underbrace {\gamma } _{\text{discount factor}}\cdot \underbrace {\max _{a}Q(s_{t+1},a)} _{\text{estimate of optimal future value}}{\bigg )}} ^{\text{learned value}}}$$

# Q-Learning
* Q-Learning requires us to keep track of the matrix $Q$ which can be very large is the state and action space is large.

## Neural Q-Learning
* Approximate the $Q$ matrix with a deep neural network
* Works, but the learning problem has too much correlation.

## Deep Q-Learning
* DeepMind used a method called `experience replay` to decorrelate the training examples.

# Notable achievements
* Alpha Go, Learned from expert players then trained against itself. 
* Alpha Go Zero, trained entirely by playing against itelf. 
* AlphaZero, similar to above but better and general, can play several games.
* DeepMind, Playing Atari with Deep Reinforcement Learning (2013)

# Next session

* ?