# Neural ODEs as the deep limit of ResNets
## Part II: ResNets, Neural ODEs and limits
### Benny Avelin (J. work with Kaj Nyström)

# Overview

* What is a ResNet
* What is a Neural ODE
* What is the limit in the title?
* What are our results?

# ImageNet Large Scale Visual Recognition Challenge
## (ILSVRC or ImageNet)
* Over 14 Million hand annotated images, more than 20,000 categories.
* Ran 2010-2017
<p><a href="https://commons.wikimedia.org/wiki/File:ImageNet_error_rate_history_(just_systems).svg#/media/File:ImageNet_error_rate_history_(just_systems).svg"><center><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/4/4f/ImageNet_error_rate_history_%28just_systems%29.svg/1200px-ImageNet_error_rate_history_%28just_systems%29.svg.png" width=500px alt="ImageNet error rate history (just systems).svg"></center></a><br>By <a href="//commons.wikimedia.org/w/index.php?title=User:Gkrusze&amp;action=edit&amp;redlink=1" class="new" title="User:Gkrusze (page does not exist)">Gkrusze</a> - <span class="int-own-work" lang="en">Own work</span>, <a href="https://creativecommons.org/licenses/by-sa/4.0" title="Creative Commons Attribution-Share Alike 4.0">CC BY-SA 4.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=69750373">Link</a></p>

# 2015 winner, ResNet 
#### He, Zhang, Ren, Sun
<center>
    <img src="ResNetBlock.png">
</center>

# ResNet
<img src="ResNet.png" widht=900px>

* Call $y_{n-1}$ as the input
* Call $\mathcal{F}$ as the `weight+relu+weight` with weights $\theta_{n-1}$.
* Call $\sigma$ the ReLU
$$
    y_n = \sigma (y_{n-1} + \mathcal{F}(y_{n-1},\theta_{n-1})
$$

# ResNet
* this is almost a discrete ODE, lets remove the outer ReLU
$$
    y_n-y_{n-1} = \mathcal{F}(y_{n-1},\theta_{n-1})
$$

* Euler discretization of the ODE
$$
    \dot{y_t} = \mathcal{F}(y_t,\theta_t)
$$

# NeuralODEs (2018)
#### Chen, Rubanova, Bettencourt, Duvenaud
$$
    \dot{y_t} = f(y_t,t,\theta)
$$
* $f$ is the network
* $y$ is the ODE solution
* $\theta$ are the parameters

* Problem is that if we discretize this and want to compute the gradient, the memory requirements scales with the number of time steps!
* Their idea, use an adjoint ODE to compute the gradient of the loss, constant memory req.

# Deep limit

> Q: Does the SGD for ResNet converge to the SGD for the Neural ODE as the number of layers tend to infinity?

Finite layer version
$$
\begin{align}
	\label{diffeq}
	x^{(N)}_{i+1}(x,\theta)&= x^{(N)}_i(x,\theta) + \frac{1}{N} f_\theta(x^{(N)}_i(x,\theta)),\ i = 0,\ldots, N-1,\notag\\
x^{(N)}_0(x,\theta) &= x.
\end{align}
$$

Limit problem (Neural ODE)
$$
\begin{align}
	\label{ODE}
	\dot{x}(t)&=f_\theta(x(t)),\ t\in (0,1],\ x(0)=x,
\end{align}
$$

Corresponding risks

$$
\begin{align*}
	\mathcal{R}^{(N)}(\theta) := \mathbb{E}_{(x,y) \sim \mu} \left [ \|y-x^{(N)}_N(x,\theta)\|^2\right ]+\gamma H(\theta)
\end{align*}
$$

$$
\begin{align*}
	\mathcal{R}(\theta) := \mathbb{E}_{(x,y) \sim \mu} \left [ \|y-x(1,x,\theta)\|^2\right ]+\gamma H(\theta).
\end{align*}
$$

$H$ is a convex penalization.

The corresponding SGD approximation

$$
\begin{align}\label{sgdsys}
	d\theta^{(N)}_t &= - \nabla \mathcal{R}^{(N)}(\theta^{(N)}_t) + \Sigma dW_t,\notag \\
	d\theta_t &= - \nabla \mathcal{R}(\theta_t) + \Sigma dW_t,
\end{align}
$$
for $t \in [0,T]$.

> Q: In what sense does $\theta_t^{(N)} \to \theta_t$?

### Thm (A, Nyström):
There exists a penalization $H$ such that
$$
    \begin{align*}
		\sup_{t\in[0,T]} \|\theta_t-\theta^{(N)}_t\| \to 0 \quad \text{in probability as $N \to \infty$}
	\end{align*}
$$

$$
\begin{align*}
		\mathbb{E} [\mathcal{R}(\theta_T)] < \infty, \quad \mathbb{E} [\mathcal{R}^{(N)}(\theta^{(N)}_T)] < \infty.
	\end{align*}
$$

> Q: Can we infer anything about $\mathcal{R}^{(N)}(\theta_T^{(N)}) \to \mathcal{R}(\theta_T)$?

* The problem is the interplay between the rapid growth of $\mathcal{R}$ and we seem to loose control of the difference.

* One could argue that this problem is purely academic, in reality we are working with numbers that have a maximum representable value.

### Additional assumption (Capped model)
$$
\begin{align}\label{eq:modified_risk-}
\tilde {\mathcal{R}}^{(N)}(\theta)&:= \mathbb{E}_{(x,y) \sim \mu} \left [ \|y-x^{(N)}(1,T_{\Lambda}(\theta))\|^2\right ]+\gamma H(\theta),\notag\\
\tilde {\mathcal{R}}(\theta)&:= \mathbb{E}_{(x,y) \sim \mu} \left [ \|y-x(1,T_{\Lambda}(\theta))\|^2\right ]+\gamma H(\theta),
\end{align}
$$
$H$ is a convex potential with quadradic growth.

### Thm (A, Nyström): (Capped Model)
Assume that the initial density $p_0$ of $\theta_t,\theta^{(N)}_t$ is compactly supported, then
$$
\begin{align}
		\label{thm2.4a}
		\sup_{t\in[0,T]}\bigl \|\mathbb{E}[\theta_t-\theta^{(N)}_t]\bigr \|&\leq   c N^{-1}\|p_0\|_2,\\
		\label{thm2.4b}
		\sup_{t\in[0,T]}\bigl |\mathbb{E}[\tilde {\mathcal{R}}(\theta_t)-\tilde {\mathcal{R}}^{(N)}(\theta^{(N)}_t) ]\bigr |&\leq  c N^{-1}\|p_0\|_2.
	\end{align}
$$

Let $p(\theta,t),p^{(N)}(\theta)$ be the probability densities of $\theta_t, \theta_t^{(N)}$
$$
\begin{align} \label{esta1}
	\int\limits_{\mathbb R^m}e^{\gamma H(\theta)/4}|p(\theta,T)-p^{(N)}(\theta,T)|\, d\theta&\leq  c N^{-1}\|p_0\|_2,\\
\int\limits_{B(0, 2^{k+1}\tilde R_0)\setminus B(0, 2^{k}\tilde R_0)}e^{\gamma H(\theta)/4}|p(\theta,T)-p^{(N)}(\theta,T)|\, d\theta&\leq  c e^{-2^k\tilde R_0^2/T} N^{-1}\|p_0\|_2, \notag
\end{align}
$$

# Cifar10
60000 32x32 colour images in 10 classes, with 6000 images per class

<center>
<img src="cifar10_plot.png"></img>
</center>

# Numerical experiment
![CIFAR10_ACC.PNG](CIFAR10_ACC.PNG)