
We have a model and data. How do we fit that model to the data?? AKA parameter estimation, inverse problems, training, etc.

## Shooting Method for Parameter Fitting

We have some model $u=f(p)$ where $p$ represents our parameters. We put in some parameters, $p$ and recieve our simulated data $u$. How should you choose $p$ such that u best fits that data? 

The **shooting method** directly uses this model by putting a cost function on the output $C(p)$. A common loss function is the L2-loss function:
$$
C(p)=\|f(p)-y\|
$$
where $C(p): \mathbb{R}^{n} \rightarrow R $ is a function that returns a scalar. The shooting method directly optimizes this cost function by having the optimizer generate data given new choices of p.

Julia has several nonlinear optimization methods. Here's three: JuMP.jl, Optim.jl, NLopt.jl.

In general there are two methods:

Local Optimization - Attempt to find the best nearby extrema by finding a point where $\frac{d C}{d p}=0$. 

Global Optimization - Attempt to explore teh whole space and find the best of the extrema. Global methods are extremley computaionally difficult. 

We'll look at local optimization.

The simplist of this is gradient descent, where given the parameters $p_i$, the next step, $p_{i+1}$ is given by:
$$
p_{i+1}=p_{i}-\alpha \frac{d C}{d P}
$$

We could do this, in a first order fashion, slowing following the path of steepest descent. Or we could think of this problem using Newton's method... What if we just treat it like a rootfinding problem? Solving for when $\frac{dC}{dP}=0$.

Newton's method would then look like:


$$
p_{i+1}=p_{i}- \frac{d C}{d p}\left(\frac{d}{d p} \frac{d C}{d p}\right)^{-1}
$$

$$
p_{i+1}=p_{i}-\frac{d C}{d p}\left( \frac{d^2 C}{d p^2}\right)^{-1} 
$$


*(Reminder): Newton's Method is formally:*
$$x_{n+1}=x_{n}-\frac{f\left(x_{n}\right)}{f^{\prime}\left(x_{n}\right)}$$


But, $\left( \frac{d^2 C}{d p^2}\right)$ is the hessian! (Jacobian of gradient). Thus we can rewrite our new problem to be:
$$
p_{i+1}=p_{i}-H\left(p_{i}\right)^{-1} \frac{d C\left(p_{i}\right)}{d p}
$$
where $ H(p) $ is the Hessian matrix $H_{i j}=\frac{d C}{d x_{i} d x_{j}}$

Solving equations with a hessian is difficult. Many techniques attempt to avoid the Hessian (as wellas the jacobian). A common technique is BFGS which is a gradient-based optimization method that attempts to approximate the Hessian along the way to modify its stepping behavior. It uses the history of previously calculated points to build this hessian approximate. If you keep a constant lenght of history, you get the I-BFGS technique, which is one of the most common large-scale optimization techniques.

# Connecting optimization and differential equations...
Lets say we want to follow the gradient of the solution towards a local minimum. That would mean that the flow that we would wish to follow is given by an ODE... 

Specifically the ODE:
$$
p^{\prime}=-\frac{d C}{d p}
$$

Ok, we can apply Euler method to this and get:
$$
p_{n+1}=p_{n}-\alpha \frac{d C\left(p_{n}\right)}{d p}
$$
(We get gradient descent ^^). 

Now lets apply implicit Euler, we get:
$$
p_{n+1}=p_{n}-\alpha \frac{d C\left(p_{n+1}\right)}{d p}
$$
Which we can solve for zeros of $p_{n+1}$ as before with Newton's method.  


# Neural Network Training as a shooting method for functions

A one layer dense neuron is traditionally written as the function:
$$\text{layer}(x) = \sigma .(Wx+b)$$

<span style="color:red"> A "layer" is a weight matrix $W$ dotted with the input array (the previous layer values) $x$ plus the bias array of each node. This equals an array of the next layer's values. After this has been computed, there is an activation function applied, $\sigma$. I'm currently working on a "neural nets from scatch" implementation that could be linked here that explains this a little bit better. </span>



$$
\begin{array}{l}\text { where } x \in \mathbb{R}^{n}, W \in \mathbb{R}^{m \times n}, b \in \mathbb{R}^{m} \text { and } \sigma \text { is some choice of } \mathbb{R} \rightarrow \mathbb{R} \\ \text { nonlinear function, where the } . \text { is the Julia dot to signify element-wise } \\ \text { operation. }\end{array}
$$




A traditional neural network, is a 3 neuron function...
$$
N N(x)=W_{3} \sigma_{2} \cdot\left(W_{2} \sigma_{1} \cdot\left(W_{1} x+b_{1}\right)+b_{2}\right)+b_{3}
$$
where the first layer is the input layer, second is the hidden layer, and the final is called the output layer. 

However, why do we need to restrict ourselves to only 1 hidden layer. This is where *deep neural networks (DNN)* come in. 

$$
\begin{array}{c}v_{i+1}=\sigma_{i} \cdot\left(W_{i} v_{i}+b_{i}\right) \\ v_{1}=x \\ D N N(x)=v_{n}\end{array}
$$

for n layers. This theory gives a direct way to transform the fitting of an arbitrary function into a parameter shooting problem. Given an unknown function $f$, one can define a cost function:

$$
C(p)=\|D N N(x ; p)-f(x)\|
$$
(We are in discrete world though):
$$
C(p)=\sum_{k}^{N}\left\|D N N\left(x_{k} ; p\right)-f\left(x_{k}\right)\right\|
$$

## <span style="color:red"> OK OK here's the cool part... </span>

## Recurrent Neural Networks (RNN)

A recurrent neural network is a network given by a recurrance relation:
$$
x_{k+1} = x_k + \text{DNN}(x_k, k ;p)
$$

BUT this is the **exactly** equivalent to euler discretization with $\Delta t = 1$, on the *neural ordinary differential equation* defined by: 

$$x' = \text{DNN}(x,t;p)$$

The neural net *is* the discrete ODE solver...

# Computing Gradients
Many problems training neural networks, to fitting differential equations, all have the same mathematical structure, which requires the ability to compute the gradient of a cost function given model evaluations. This reduces to computing the gradient of the model's output given the parameters. Proof:
$$
C(p)=\sum_{i}^{N}\left\|f\left(x_{i} ; p\right)-y_{i}\right\|
$$
$$
\frac{d C}{d p}=\sum_{i}^{N} 2\left(f\left(x_{i} ; p\right)-y_{i}\right) \frac{d f\left(x_{i}\right)}{d p}
$$
How to efficiently compute $\frac{df(x_i)}{dp}$ is the essential question for shooting-based parameter fitting.

## Forward and Backwards Propogation and their Pros and Cons are covered in NeuralNetsFromScratch. (Not that anyone is actually reading this)...

But ok after long maths, backward propogation wins! With one traverse of the graph we get all the derivatives of our output with respect to one input.

**Sensitivity** The derivative. 

# Logistic Regression Example

In [None]:
function L2_loss(σ,sol,W)
    #σ is the function
    #W is the weights
    #Sol is the answer we're testing it off of
    norm(σ(W) - sol)
end