# Machine Learning for Dynamical Systems 

How can we learn dynamical systems from data? 
How can we use Machine Learning methods like Artificial Neural Networks for dynamical sytems? 


## Methods for this Workshop 

* Neural Differential Equations 
* Reservoir Computing 
* Symbolic Methods (SINDy) 

### Approximate a Dynamical System 


* Suppose we have a dynamical system $$\frac{d\mathbf{x}}{dt} = f(\mathbf{x},t;\theta)$$ that we observe some data from $\mathbf{X}=\{\mathbf{x}(t_i)\}$ for $t_i\in[t_0;t_f]$, in the beginning we will restrict ourselves to evenly sampled observations $t_i = i\cdot\Delta t + t_0$

* The correspoding discretized dynamical system is $$\mathbf{x}_{n+1} = g(\mathbf{x}_n, t_n; \theta)$$ where $g$ is one iteration of some numerical DE solver

* With **Neural Differential Equations** we replaced $f$ with an ANN and used it as a universal function approximator to learn the right hand side of the dynamical system from the observation $\mathbf{X}$
 
* With **Recurrent Neural Networks** (RNNs) like **Reservoir Computing**, we try to replace $g$ with ANN an learn it from data 

* When we want to recover the actual analytical expression of $f$, we have to rely on symbolic methods such as **Sparse Indentification of Dynamical Systems (SINDy)**

### Prior Knowledge 

* Aside from MLPs, there are many different types of ANNs, e.g:![image-2.png](projects/NeuralDifferentialEquations/notebook-assets/typesofanns.png)


## Artificial Neural Networks: A quick summary

Artificial neural networks (ANNs), in the form of multilayer perceptrons (MLPs) are:

* Networks made up of a chain of several layers: $$f(\mathbf x) = f^{(3)}(f^{(2)}(f^{(1)}(\mathbf x; \theta^{(1)}); \theta^{(2)}); \theta^{(3)}),$$ where each layers is a parametrized nonlinear transformation: $$\sigma(\mathbf z^{\mathrm T} \mathbf W + \mathbf b),$$ where $\theta=\{\mathbf{W},\mathbf{b}\}$ are the learnable parameters and $\sigma$ is a nonlinear activation function, such as $\tanh$
* They are *universal function approximators* with many parameters $\theta$

* Given training data, we seek the best set of parameters by minimzing a loss function $L(\theta)$, e.g. a mean square error, on that training data, by means of a gradient descent optimization, for which we need $$\nabla_\theta L(\mathbf \theta).$$ 
* The gradients are computed via *backpropagation*, i.e. chain rule of derivatives
* ML libaries compute this via an automatic differenation system, that is able to systematically track all elementary operations performed
* The parameters are then updated with some form of gradient descent 
$$\theta_{new} = \theta - \eta \nabla_\theta L(\mathbf \theta)$$, 
with some learning rate $\eta$

### The Key Ingredients 

* Training data 
* A parametrized model 
* A loss function 
* The ability to take gradients of the loss function, to update the parameters of the model

## Neural Differential Equations 

## Neural Differential Equations 

* With Neural Differential Equations (Neural DEs) we try to find a way to combine knowledge that we have of systems in form of their governing equations with data-driven approximators such as ANNs

* In fact, there is an analogy between differential equations and ANNs:

* A Residual Network (ResNet) block is defined by $$\mathbf{h}_{t+1} = \mathbf{h}_t + f(\mathbf{h}_t;\theta_t),$$ where $f$ can be any combination of other neural network layers with parameters $\theta_t$

* Through their short cut connection (see the image), ResNets learn a residual (hence the name). They proved to be an effective architecture for a wide variety of problems

* Compare that to the Euler solver to discretize and solve an ODE: 
$$\begin{align}  
\frac{d\mathbf{h}(t)}{dt} &= f(\mathbf{h}(t),t;\theta)\\
\mathbf{h}_{t+1} &= \Delta t f(\mathbf{h}_t, t;\theta) + \mathbf{h}_t\\
\end{align}$$

* Differential equations can be seen as a continuous time limit of ResNet ANNs
* There are several paper that use to just solve ResNets with ODE solvers, but this is not our primary interest

### The Universal Differential Equations Framework

* If we can treat differential equations and ANNs so similar, we just combine them directly:

$$\frac{du}{dt} = f(u,t,U_\theta(u,t)),$$ 
where $U_{\theta}$ is some data-driven function approximator (such as an ANN)
  
![image-2.png](projects/NeuralDifferentialEquations/notebook-assets/overview2.png)

* We can integrate these Neural Differential Equations numerically, like any other differential equation we've seen before, resulting in a trajectory $$\hat{\mathbf{u}}(\mathbf{x},t;\theta)$$

* `DiffEqFlux.jl`/`DiffEqSensitivity.jl` are by far the most comprehensive implementation of this approach (of all programming languages)
* `torchdiffeq` offers some of the functionality for pyTorch
* `diffrax` offers some of the functionality for JAX

### How to train them? 

* We train them by minimizing a loss function $L(\theta)$, e.g.: 
    * Given some example trajectories $\mathbf{u}$ as training data, we can define a loss function $$L(\theta)= \sum_{i_t,\mathbf{x}} ( \mathbf{u}(\mathbf{x},i_t) - \hat{\mathbf{u}}(\mathbf{x},i_t;\theta) )^2$$

* For this approach to work we need the ability to take the derivatives of the trajectories
* We can in fact do this by combining adjoint sensitivity analysis with automatic differentiation techniques (AD)

## Reservoir Computing 

* Reservoir computing is a type of **Recurrent neural networks (RNNs)**,  a family of neural networks for processing sequential data.

* Consider a sequence of the form

$$\mathbf s^{(t)} = f(\mathbf s^{(t-1)}; \mathbf \theta),$$

where $\mathbf s^{(t)}$ is a vector describing the value of the sequence at the discrete index $t$, and $\mathbf \theta$ are some parameters of the function $f$.

![rnn1](notebook-assets/rnn-1.png) [Source](https://www.deeplearningbook.org/)

* We call such a sequence **recurrent** because the definition of $\mathbf s$ at time $t$ refers back to the same definition at time $t-1$.

* More generally, we can also consider the RNN exhibiting memory (e.g. of an external driver of the dynamical system)

* This is usually realised by including a hidden state $\mathbf{h}$

$$\mathbf h^{(t)} = f(\mathbf h^{(t-1)}, \mathbf x^{(t)}; \mathbf \theta),$$

* We update the hidden state at each iteration, and the output of at each step is usually given as a function of the hidden state:

$$\mathbf{s}^{(t)} = g(\mathbf{h}^{(t)}, \mathbf{s}^{(t-1)}; \theta)$$

This will be the **general form of a RNN** that we will consider. 

* A RNN can develop self-sustained dynamics due to its recurrent connections, even in the absence of input. Indeed, it can be shown that, under fairly mild and general assumptions, that RNNs are **universal approximators of dynamical systems**.


* RNNs suffer from the **vanishing and exploding gradient problem**
* When computing gradients through a long chain of RNN iterations, the gradients are scaled with the eigenvalues of the iteration operation
* This can lead to the gradients vanishing or exploding 
* There are several ways to mitigate this problem
* One particular simple one is **Reservoir Computing**

## Reservoir Computing 

* Reservoir computing is one implemenation of RNNs 

![rnn1](notebook-assets/reservoir.png)

* They are ANNs with 
    * an input layer $W_{in}$
    * a large hidden layer, the reservoir, that is usually a sparse random network $W$
    * an output layer $W_{out}$
    
* The key trick that reservoir computing does is that **only the output layer is trained** and the input layer and the reservoir are randomly initialized but constant 

* This has the advantage that the training can be done via linear regression

* Reservoir computing has been both applied successfully to prototypical chaotic systems, and climate phenomena 
* See e.g:
    1. [Using machine learning to replicate chaotic attractors and calculate Lyapunov exponents from data (Pathak et al. 2017)](https://aip.scitation.org/doi/10.1063/1.5010300)
    2. [Model-Free Prediction of Large Spatiotemporally Chaotic Systems from Data: A Reservoir Computing Approach (Pathak et al. 2018)](https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.120.024102)
    3. [Seasonal prediction of Indian summer monsoon onset with echo state networks (Mitsui & Boers 2021)](https://iopscience.iop.org/article/10.1088/1748-9326/ac0acb)
   

## Symbolic Methods (SINDy)

* Another approach to estimate the dynamical system is to try to directly estimate it's equation from data, so that we really to reconstruct the symbolic expression of $f$ and not just numerically approximate the $f$ with a function approximator such as an ANN

* **Symbolic Regression** 

    * But instead of already prescriping the functional form in a concrete way, can't we also let an algorithm find the functional form? 
    * Symbolic Regression tries to find the mathematical expression that best fits the $\mathbf{X}$
    * Applied to dynamical systems, it usually tries to find the mathematical expression for the right hand side $f$ of the system
    * Therefore we often also need derivative data (e.g. computed with finite differences) 
    * Symbolic regression usually provides a dictionary of possible expressions (e.g. polynomials up to a certain degree, trigonemtric functions, etc ...) and than performs a regression to dermine the coefficents or parameters of these elementary functions
    * But there are infintely many combinations of expressions: 
    
    ![Symbolic Regression](notebook-assets/slice2.jpg)
    
    * Most natural laws and equations just involve a handful of terms, a candidate model should be complex enough to replicate the behaviour of the system but also "simple" (see Occam's razor)
    * Therefore often some form of sparsity constraint is applied to the regression and one choses to only consider certain operations and experessions 
    * [AI Feynmann by Udrescu and Tegmark](https://arxiv.org/abs/1905.11481) attracted some attention: they do a symbolic regression with several different pre- and post-processing steps and apply it successfully to Feynmann's physics course books
    

* All of these methods have limitations when the complexity of the problem increases, data gets noisy and high-dimensional

* One reason: There are just too many possible combinations of expressions to be considered 



## Projects

Now it's your turn! 

## Projects 

* For each of the methods we prepared recources for you to get started with them and work on a small project. 

* You find everything at https://github.com/TUM-PIK-ESM/ML-DS-Workshop-23 

* Best, you clone the repository 

## Recources  

* A Julia cheat-sheet 
* [Projects with Neural Differential Equations](projects/NeuralDifferentialEquations/NeuralDifferentialEquations.ipynb)
* Reservoir Computing 
* SINdy