<a href="https://colab.research.google.com/github/GDS-Education-Community-of-Practice/DSECOP/blob/main/Automatic_Differentiation/01_parameter_estimation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Estimating parameters of a differential equation

## RLC Circuit

The RLC circuit is a simple model of a resistor, inductance, and
capacitor connected in series as seen below:

![Example of an RLC circuit](https://raw.githubusercontent.com/GDS-Education-Community-of-Practice/DSECOP/main/Automatic_Differentiation/images/rlc_circuit.drawio.svg)

When the switch is closed, the battery/generator provides an electrical
current through the entirety of the circuit. The voltage added by this
battery/generator can be written as $E(t)$. At any time $t$, the current
at any point in the circuit is given by $I(t)$ and if $I(t) > 0$ the
flow is clockwise around the circuit and if $I(t) < 0$ the flow is
counter-clockwise.

The first component in the circuit is the “resistor” which reduces DC or
AC current and releases the excess energy as heat. It causes a voltage
potential change from one end to the other. This voltage change caused
by the resistor can be written as:

<span id="eq-resistance">$$ V_R = RI(t)  \qquad(1)$$</span>

where $R>0$ is the resistance for the given resistor.

The next component is the “inductor” which also reduces AC current but
stores the excess energy in the form of a magnetic field. It also causes
a voltage potential change from one end to the other. This voltage
change can be written as:

<span id="eq-inductance">$$ V_L = L\frac{d}{dt}I(t)  \qquad(2)$$</span>

where $L>0$ is the inductance for the given inductor.

The final component is the “capacitor” which stores up electrical charge
$Q(t)$. This charge is related to the current $I(t)$ by the following
integral:

$$ Q(t) = Q(t_0) + \int_0^t I(s) ds. $$

This integral expresses that the charge at time $t$ is equal to the
charge at starting time $t_0$ plus the sum of the current from the start
time until the present. Since the current is a continuous quantity, the
integral is needed rather to sum the current. This relationship can be
more simply written as $\frac{d}{dt}Q(t) = I(t)$. Given this, the
voltage change caused by the capacitor can be written as:

<span id="eq-capacitance">$$ V_C = \frac{Q(t)}{C}  \qquad(3)$$</span>

where $C>0$ is the capacitance for the given capacitor.

[Kirchoff’s
law](https://en.wikipedia.org/wiki/Kirchhoff%27s_circuit_laws) states
that the directed sum of voltages across a closed loop is 0 or equal to
the “impressed voltage” or the voltage added from an energy source. In
this RLC case, this means that:

$$ V_R + V_L + V_C = E(t). $$

By substituting in our voltages from [Equation 1](#eq-resistance),
[Equation 2](#eq-inductance), [Equation 3](#eq-capacitance), we can
expand this equation to:

$$
\begin{align*}
V_R + V_L + V_C &= E(t) \\
RI(t) + L\frac{d}{dt}I(t) + \frac{Q(t)}{C} &= E(t) \\
R\frac{d}{dt}Q(t) + L\frac{d^2}{dt^2}Q(t) + \frac{Q(t)}{C} &= E(t)
\end{align*}
$$

if we substitute $Q'(t)$ for $\frac{d}{dt}Q(t)$ and rearrange:

<span id="eq-rlc-diffeq">$$ L Q''(t) + RQ'(t) + \frac{Q(t)}{C} = E(t)  \qquad(4)$$</span>

[Reference](https://math.libretexts.org/Bookshelves/Differential_Equations/Elementary_Differential_Equations_with_Boundary_Value_Problems_(Trench)/06%3A_Applications_of_Linear_Second_Order_Equations/6.03%3A_The_RLC_Circuit)

### Simulating the RLC circuit

[Equation 4](#eq-rlc-diffeq) is actually simple enough to be solved with
pen and paper. But in order to practice techniques that can be applied
to more complicated equations that can’t be solved directly, let’s
consider simulating the equation.

We can use the `scipy.integrate` python library to simulate this
equation. First, let’s import the Python libraries we will need:

In [None]:
import numpy as np
import scipy.integrate as si
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
from IPython.display import HTML

Now, we can setup the simulation parameters such as the starting time
and the equation parameters:

In [None]:
t0 = 0
R = 1
L = 10
C = 2
E = 60

In `scipy`, to solve for a quantity $y(t)$ it is assumed that the
equation is of the form $y'(t) = f(t,y)$ where $f$ is just a function of
time $t$ the quantity of interest $y(t)$. In our case, we need to
introduce a “dummy variable” $D(t) = Q'(t)$ to be able to write our
equation in that form:

$$
\begin{align*}
L Q''(t) + RQ'(t) + \frac{Q(t)}{C} &= E(t) \\
L D'(t) + RD(t) + \frac{Q(t)}{C} &= E(t) \\
D'(t) + \frac{R}{L}D(t) + \frac{Q(t)}{LC} &= \frac{E(t)}{L}
\end{align*}
$$

This may seem like a small change but it allows us to write our equation
in the form:

$$
\frac{d}{dt}
\begin{bmatrix}
D(t) \\ Q(t)
\end{bmatrix}=
\begin{bmatrix}
-\frac{R}{L}D(t) - \frac{Q(t)}{LC} + \frac{E(t)}{L} \\ D(t)
\end{bmatrix}
$$

Now we have a form written as:

$$
\begin{align}
y(t) &= \begin{bmatrix}D(t) \\ Q(t)\end{bmatrix} \\
f(t,y) &= \begin{bmatrix}-\frac{R}{L}D(t) - \frac{Q(t)}{LC} + \frac{E(t)}{L} \\ Q(t)\end{bmatrix}
.\end{align}
$$

This can be written in Python (in `scipy` form) as:

In [None]:
# Assuming y[0] = D(t) and y[1] = Q(t)
def f(t,y,R,L,C,E):
    return np.array([-R*y[0]/L - y[1]/(L*C) + E/L, y[0]])

We can now find the amount of current in the circuit over time by
simulating the equation (i.e. numerically approximating the solution at
specific points in time) using `scipy` as follows:

In [None]:
tf = 200                         # Final time
y0 = np.zeros(2)                 # Starting conditions, Q(t0) = D(t0) = 0
ps = (R,L,C,E)                   # Parameters: R, L, C, E
times = np.linspace(t0, tf, 200) # Times to collect simulation at
approx_solution = si.solve_ivp(f, (t0, tf), y0, args=ps, t_eval=times)

We can then plot the approximate solution with:

In [None]:
plt.plot(approx_solution.t, approx_solution.y[1])
plt.show()

This demonstrates that after connecting the battery/generator, the
circuit increases in current, oscillates for some time, then ultimately
settles into a constant current.

*Note: `t_eval=times` means that we collect information at `times`
(which we will use later). Plotting only `approx_solution.y[1]` means we
just take the values for $Q$ and not for the dummy variable $D$.*

## Estimating parameters

[Equation 4](#eq-rlc-diffeq) has a few key parameters that influence how
quickly the current stabilizes. Specifically, the resistance $R$, the
inductance $L$, and the capacitance $C$. These are often assumed to be
some values for the sake of simulation and analysis. However, in many
real world systems, these parameters may not be known exactly. For
example, if an RLC circuit comes out of a factory, there may be a range
of possible values for $R$, $L$, and $C$ due to manufacturing tolerances
and material properties.

We thus arrive at a key data science question: **“can we choose
parameters for a given model such that the output matches what we
observe in the real world”?** This problem is called “parameter
estimation” and is extremely important in a wide variety of real world
situations.

**Contextual problem:** Let’s consider that we have some data from an
RLC circuit with a capacitance that is well known to be $C=2$ but for
which $R$ and $L$ are uncertain (having only two unknown parameters
makes things easier to visualize). We’d like to know estimate these two
uncertain parameters. The previously simulated data can act as “fake
data” of the measured current in the circuit.

Let’s first define our “objective” (also often called a “loss” or
“error”). We’d like to choose $R$ and $L$ such that the simulated
current $Q(t)$ matches the observed current $\hat{Q}(t)$. We can use a
common metric called the “least-squares” or “mean squared-error”
comparison for our objective:

$$
\mathcal{L}(R,L) = \frac{1}{N}\sum_{t=t_0}^{t_N} \left( Q(t,R,L) - \hat{Q}(t) \right)^2
$$

Note that because our data is only observed at $N$ specific points, the
objective function is a sum rather than an integral. We use the letter
$\mathcal{L}$ for the objective because it is commonly called the
“loss”. We’d like the loss to be as small as possible because that would
represent the best possible choice of $R$ and $L$. We’d thus like to
“minimize the loss.” We can write this simple loss as a Python function:

In [None]:
# First define how we calculate Q over time with simulation
def Q(R,L):
    return si.solve_ivp(f, (t0, tf), y0, args=(R,L,C,E), t_eval=times).y[1,:]

def loss(R,L,Qhat):
    Q_RL = Q(R,L)
    result = 0
    N = len(Q_RL)
    for i in range(len(Q_RL)):
        result += (Q_RL[i] - Qhat[i])**2
    return result / N

*Note: the calculation of `Q(R,L)` is using variables
`t0`,`t`,`y0`,`C`,`E` from the previous simulation.*

Given this objective, we can try to find the optimal value of $R$ using
simple calculus. i.e. the function $\mathcal{L}$ is minimized with
respect to $R$ at $R=R^*$ if $\frac{d}{dR}\mathcal{L}(R^*) = 0$. We can
write this out as:

<span id="eq-grad-loss">$$
\begin{align}
\frac{d}{dR}\mathcal{L}(R^*,L) &= \frac{d}{dR}\frac{1}{N}\sum_{t=t_0}^{t_N} \left( Q(t, R^*, L) - \hat{Q}(t) \right)^2 \\
 &= \frac{1}{N}\sum_{t=t_0}^{t_N} 2\left( Q(t, R^*, L) - \hat{Q}(t) \right)\frac{d}{dR}Q(t,R^*,L)
\end{align}
 \qquad(5)$$</span>

The exact same formula would be valid if we were looking to find the
optimal value of $L$, so we end up with a system of equations:

$$
\begin{align}
\begin{bmatrix}0\\0\end{bmatrix} &=
\begin{bmatrix}
\frac{1}{N}\sum_{t=t_0}^{t_N} 2\left( Q(t, R^*, L) - \hat{Q}(t) \right)\frac{d}{dR}Q(t,R^*,L) \\
\frac{1}{N}\sum_{t=t_0}^{t_N} 2\left( Q(t, R, L^*) - \hat{Q}(t) \right)\frac{d}{dL}Q(t,R,L^*)
\end{bmatrix}
\end{align}
$$

Well, that’s a nice simple formula to find our optimal values! But there
are two outstanding questions that remain:

1.  How can we find the derivative of our simulation values $Q(t,R,L)$
    with respect to $R$ or $L$?
2.  Once we have those derivatives, how can we actually solve the
    equation above?

There are several answers to both questions, but the following two
sections will cover a simple option for each.

### Finite differences

The definition of the derivative $\frac{dQ}{dR}$ is:

$$
\lim_{h\to 0} \frac{Q(t,R+h,L) - Q(t,R,L)}{h}.
$$

On a computer, we can approximate this by choosing a small $h$ and then
approximating the integral with.

<span id="eq-finite-diff">$$
\frac{dQ}{dR} \approx \frac{Q(t,R,L) - Q(t,R+h,L)}{h}
 \qquad(6)$$</span>

This is called a “finite difference” approximation. Note that this
approximation is only valid for fixed values of $t$ and $L$.

We can write a simple function in Python to compute this value for our
simulation:

In [None]:
# Compute the pieces we need for derivatives:
# dQ/dR = [Q(R+h,L) - Q(R,L)] / h
def dQR(R,h):
    Q_RL = Q(R,L)
    dQdR = (Q(R+h,L) - Q_RL) / h
    return np.array([dQdR])

----
#### Exercise 1
The above function `dQR` only computes the derivative $\frac{dQ}{dR}$. Write a function `dQ(R,L,h)` that computes both $\frac{dQ}{dR}$ and $\frac{dQ}{dL}$ and returns them in a Numpy array `np.array([...])`.

*Solution:*

----
We can then write a function to compute the derivative of our gradient
from [Equation 5](#eq-grad-loss) as:

In [None]:
def dloss(Qhat,R,L,h):
    Q_RL = Q(R,L)
    dQ_RL = dQ(R,L,h)
    N = len(Q_RL)
    return 2*np.sum((Q_RL - Qhat) * dQ_RL, axis=1) / N

### Gradient descent

Given the derivative information, how can we solve the system of
equations

$$
\begin{align}
\begin{bmatrix}0\\0\end{bmatrix} &=
\begin{bmatrix}
\frac{1}{N}\sum_{t=t_0}^{t_N} 2\left( Q(t, R^*, L) - \hat{Q}(t) \right)\frac{d}{dR}Q(t,R^*,L) \\
\frac{1}{N}\sum_{t=t_0}^{t_N} 2\left( Q(t, R, L^*) - \hat{Q}(t) \right)\frac{d}{dL}Q(t,R,L^*)
\end{bmatrix}
?\end{align}
$$

There is a simple technique called “gradient descent” that can be used
to solve this problem. The premise is that if we can estimate what the
optimal values $R^*,L^*$ are, then we can follow the gradient
(derivatives) $\frac{d\mathcal{L}}{dR},\frac{d\mathcal{L}}{dL}$ down to
the optimal values step by step. This procedure can be written as:

$$
\begin{bmatrix}
R_{n+1} \\ L_{n+1}
\end{bmatrix} =
\begin{bmatrix}
R_{n} - \alpha \frac{d\mathcal{L}}{dR}(R_n) \\
L_{n} - \alpha \frac{d\mathcal{L}}{dL}(L_n)
\end{bmatrix}
$$

where $\alpha$ is the size of the step to take at each iteration
(commonly called the “learning rate”). We can start at some guess of
parameters $[R_0, L_0]$ and then iteratively improve them to better
match our observed data.

We can write a simple version of this in Python as:

In [None]:
def gradient_descent(p0, df, alpha=0.1, max_iter=100):
    pstar = p0
    all_pstars = []
    for n in range(max_iter):
        pstar = pstar - alpha*df(pstar)
        all_pstars.append(pstar)
    return pstar, np.array(all_pstars)

To visualize the idea behind this, consider finding an $x^*$ such that
$\frac{dy}{dx}(x^*) = 0$ for the function $y(x) = \sin(x^3 + 0.1x)$ on
the interval $\frac{-\pi}{2} \leq x \leq \frac{\pi}{3}$. Taking 100
steps of gradient descent on this function can be visualized as:

In [None]:
# Define our function and its derivative
y = lambda x: np.sin(x**3 + 0.1*x)
dydx = lambda x: np.cos(x**3 + 0.1*x)*(3*x**2 + 0.1)

# Make a guess at where the zero is
x_0 = np.pi/3

# Use gradient descent to iteratively improve our guess
best_xs, descent_xs = gradient_descent(x_0, dydx, 0.1, 100)

# Make a video showing how the iterations improved
fig = plt.figure()
xs = np.linspace(-np.pi/2,np.pi/3)
plots = [
    plt.plot(xs,y(xs),label="$y(x)$")[0],
    plt.scatter(descent_xs[0],y(descent_xs[0]),c='r',s=100,zorder=2,label="$x^*$"),
]
plt.legend()
def anim_func(i):
    plots[1].set_offsets([descent_xs[i],y(descent_xs[i])])
    return plots

anim = FuncAnimation(fig, anim_func, frames=range(len(descent_xs)), interval=100, blit=True)
plt.close()
HTML(anim.to_jshtml())

In this animation, the red dot represents the estimated value of $x^*$.
You can see that step by step, it “rolls” down the landscape toward the
smallest point. The direction it rolls is given by the derivative
$df/dx$ which points it towards smaller regions.

We would like to do exactly this for the RLC circuit using a function that is two
dimensional (because it takes in parameters $R$ and $L$).

## Example of parameter estimation

We have some data from our previous simulation and we know that the true
values are $R=1$ and $L=10$. If we take $R=8$ and $L=5$ as an initial
guess, we can plug our previous functions into the `gradient_descent`
function to get an estimate of the true values of $R$ and $L$.

In [None]:
# Some initial guesses
RL_0 = np.array([8, 5])

# The data we collected in our previous simulation
data = approx_solution.y[1,:]

# We want to have a function that only takes in R and L as inputs
# so, we fix Qhat and h
wrapped_dloss = lambda RL: dloss(data,RL[0],RL[1],1e-4)

# Find the gradient descent values
best_RL, descent_RLs = gradient_descent(RL_0, wrapped_dloss, 3e-3, 100)

The answer can be visualized similar to the example above. However, this
time we are varying parameters $R$ and $L$ so we have a two dimensional
plot:

In [None]:
def plot_RL_descent(descent_RLs):
    # Map out the landscape
    max_R = max(abs(descent_RLs[:,0].max()+.5),5); min_R = min(abs(descent_RLs[:,0].min()-.5), .5)
    max_L = max(abs(descent_RLs[:,1].max()+.5),10.5); min_L = min(abs(descent_RLs[:,1].min()-.5), .5)
    Rs = np.linspace(min_R, max_R, 30)
    Ls = np.linspace(min_L, max_L, 30)
    Qs = np.zeros((30,30))
    for i in range(30):
        for j in range(30):
            Qs[j,i] = loss(Rs[i],Ls[j],data)

    fig = plt.figure()
    plt.contourf(Rs,Ls,Qs,levels=30),
    plots = [
        plt.scatter(1,10,c='orange',s=100,zorder=2,label="Correct"),
        plt.scatter(descent_RLs[0][0],descent_RLs[0][1],c='r',s=100,zorder=3,label="Gradient descent"),
    ]
    plt.legend()
    plt.colorbar(plots[0], label="Loss")
    plt.xlabel("$R$"); plt.ylabel("$L$")
    def anim_func(i):
        plots[1].set_offsets([descent_RLs[i][0],descent_RLs[i][1]])
        return plots

    anim = FuncAnimation(fig, anim_func, frames=range(len(descent_RLs)), interval=100, blit=True)
    plt.close()
    return HTML(anim.to_jshtml())

In [None]:
plot_RL_descent(descent_RLs)

As we can see, the gradient is nice and smooth and everything works
great! Our approximation starts far away but converges quickly to the
correct values of $R$ and $L$.

### Things to watch out for

The above parameter estimation works swimmingly. However, there are
several places where numbers were selected without any justification. In
the following sections, we will go through each of these dangers
individually to explore how they can make parameter estimation
difficult.

#### Number of comparison points

In the beginnning of this notebook, the variable `times` was arbitrarily
set to be of a certain length. This defines the number of time points at
which we compare the “true” data with the simulation data for different
values of $R$ and $L$. Though this seems like a minor detail, it can
have a large impact on the success of a gradient descent. In particular,
if we use too few points, that might not be enough information to get a
good gradient with.

Let’s explore this:

In [None]:
# Using only 5 points for comparison
times = np.linspace(t0, tf, 5)
data = si.solve_ivp(f, (t0, tf), y0, args=ps, t_eval=times).y[1,:]
wrapped_dloss = lambda RL: dloss(data,RL[0],RL[1],1e-4)
_, descent_RLs_5_point = gradient_descent(RL_0, wrapped_dloss, 3e-3, 100)

plot_RL_descent(descent_RLs_5_point)

As you can see, with only a couple of points at which to compare the
simulation with the real data, we struggle to converge, indicating a
weak gradient. You can observe in the loss landscape that the gradient
is tiny almost everywhere! This makes our iterations very very slow to
approach the correct value.

If we increase that number, we can get improved results:

----
#### Exercise 2
Explore using more comparison points. Determine how many comparison points are needed to converge to the correct answer in 100 iterations. Then determine how many iterations it takes to converge to the correct point when using all the comparison points.

*Solution:*

This is somewhat of a trick question. The student should observe that it requires an outrageous number of iterations due to the other errors described below.

----

#### Initial guess

Another number that was arbitrarily chosen for the earlier parameter
estimation is the inital guess `RL_0`. After having seen the loss
landscapes in the previous animations, we can note that the speed at
which our gradient iteration approaches the correct value depends
significantly on the starting point. Furthermore, there are regions in
which it doesn’t make physical sense to start (such as for $R<0$).

Let’s explore how this starting point can impact our results:

In [None]:
# Reset our number of points
times = np.linspace(t0, tf, 200)
data = si.solve_ivp(f, (t0, tf), y0, args=ps, t_eval=times).y[1,:]

# Set an initial condition very far away
RL_0 = np.array([50, 80])
_, descent_RLs_RL0far = gradient_descent(RL_0, wrapped_dloss, 3e-3, 100)

plot_RL_descent(descent_RLs_RL0far)

Note that in this case, the descent seems like it is on a successful
track, but is moving slowly and from very far away. It is hard to say
how many iterations it will take for it to arrive at the correct values.

In [None]:
# Set an initial condition in a "bad" location
RL_0 = np.array([0.1, 5])
_, descent_RLs_RL0bad = gradient_descent(RL_0, wrapped_dloss, 3e-3, 100)

plot_RL_descent(descent_RLs_RL0bad)

As you can see, beginning too close to a region which is physically
impossible for the system results in enormous gradients that push our
iteration far away from the true values. This ultimately hampers our
ability to get accurate estimates of the parameters. To avoid this, it
is important to take some time to consider reasonable ranges for the
parameters to help try one or more inital guesses.

#### Choosing step sizes

A core component to succesful parameter estimation are the size of the
derivative step size `h` and the gradient descent step size `alpha`.
Each of these helps control the size of the gradient and are vital for a
successful approximation. Given that they both have similar effects on
the results, we will limit our exploration to the iteration step size
`alpha`. If this step size is too small, the iteration will converge too
slowly and possibly never reach the true values. On the other hand, if
this step is too large, our iterations may step right past or over the
true values which could result in our approximations approaching
infinity or yielding physically impossible parameter values.

This can be seen in the following two examples.

In [None]:
# Reset the initial condition
RL_0 = np.array([8, 5])

# Use a small step size for our gradient descent
_, descent_RLs_small_step = gradient_descent(RL_0, wrapped_dloss, 30*1e-6, 100)

plot_RL_descent(descent_RLs_small_step)

Notice how slowly the iteration progresses with this small step size. As
observed previously, this ultimately may mean that the procedure only
arrives at the true parameter values after a lot of computational time.

In [None]:
# Use a large step size for our gradient descent
_, descent_RLs_large_step = gradient_descent(RL_0, wrapped_dloss, 1e-2, 100)

plot_RL_descent(descent_RLs_large_step)

Alternatively, this large step size results in erratic and incorrect
iterations which never arrive at the correct parameter values.

#### Dealing with noisy data

A final consideration to take into account when looking at parameter
estimation is the quality of the data used for comparison. In the
examples above, the “measured” data was actually just a simulation with
some known parameter values. This means that the data is smooth, well
behaved, and is guaranteed to have a solution with our model. In real
world scenarios, the data is often polluted by some sort of measurement
noise, with some missing or completely incorrect values, and is actually
represented by a model more complicated than our simple estimate from
[Equation 4](#eq-rlc-diffeq).

We can get a simple example of this by adding some Gaussian noise to our
measurement data:

In [None]:
# Add 10% of the maximum of our data in Gaussian noise
noisy_data = data + 0.1 * data.max() * np.random.normal(size=data.shape)

# Show the new data
plt.plot(times, noisy_data); plt.title("Data with 10% noise"); plt.show()

# Gradient descent and plot
wrapped_dloss = lambda RL: dloss(noisy_data,RL[0],RL[1],1e-4)
_, descent_RLs_small_noise = gradient_descent(RL_0, wrapped_dloss, 3e-3, 100)

plot_RL_descent(descent_RLs_small_noise)

Note that this is a significant amount of noise and the data looks
fairly different. Yet still, on average the data shares the same general
characteristics and so the mean squared error difference is still a good
measure. However, the descent ultimately converges not quite to the
correct parameter values but nearby.

Let’s try with even more noise:

----

#### Exercise 3
What is the behavior of the iteration as more and more noise is added? Make a plot of the data with a larger amount of noise that demonstrates the described behavior. Include an animation like those above of the iteration.

*Solution:*


----

## Improving the gradient descent

In each of the previous subsections, we observed slow, erratic, and incorrect convergence of the parameters as they navigate the loss space during the gradient descent. Although we explored several ways to carefully select descent parameters to ensure reasonable convergence, we can also consider an improved descent procedure. There are a wide variety of popular choices for gradient descent methods, but they all generally revolve around two core ideas:

1. Momentum
2. Adaptive step sizes `alpha`

Each of these tries to overcome issues such as slow convergence or settling into incorrect locations.

### With momentum
The first approach is to incorporate momentum. This is most commonly applied if there are several minimums but we are trying to arrive at the lowest of them all. Momentum would allow the iteration to enter a minimum but have its momentum carry it out and on to another, possibly lower minimum.

Though our loss function is guaranteed to have only one minimum, it could help with our convergence.

To incorporate momentum, we can keep a running measure of the previous gradients and combine them with our current gradient.
Thus, if the previous gradients were big, even if our gradient is small, we should keep moving.

This can be written as:

$$
\begin{bmatrix}
R_{n+1} \\ L_{n+1}
\end{bmatrix} =
\begin{bmatrix}
R_{n} - \alpha \gamma_{n} \\
L_{n} - \alpha \delta_{n}
\end{bmatrix}
$$
where
$$
\begin{align*}
\gamma_{n+1} = \beta\gamma_{n} + \frac{d\mathcal{L}}{dR}(R_n) \\
\delta_{n+1} = \beta\delta_{n} + \frac{d\mathcal{L}}{dL}(L_n)
\end{align*}
$$

Notice that this does require a new parameter $\beta$ to determine how much importance to give to previous gradients.

We can adjust the previous `gradient_descent` function to easily incorporate this as a new function `gradient_descent_with_momentum` as follows:

In [None]:
def gradient_descent_with_momentum(p0, df, alpha=0.1, beta=0.3, max_iter=100):
    pstar = p0
    all_pstars = []
    sum_grad = 0
    for n in range(max_iter):
        sum_grad = beta*sum_grad + df(pstar)
        pstar = pstar - alpha*sum_grad
        all_pstars.append(pstar)
    return pstar, np.array(all_pstars)

We can use it just like we used the `gradient_descent` function:

In [None]:
RL_0 = np.array([8, 5])
data = approx_solution.y[1,:]
wrapped_dloss = lambda RL: dloss(data,RL[0],RL[1],1e-4)
best_RL, descent_RLs = gradient_descent_with_momentum(RL_0, wrapped_dloss, 3e-3, .7, 100)

plot_RL_descent(descent_RLs)

Notice how the momentum propels the iteration to rock back and forth like a ball falling down a slope. The momentum also helps carry the values through the flat spots of the loss landscape to land right on the correct value!

### Adaptive step size
The second common alteration to gradient descent is to adjust the step size `alpha` during the iteration.
One such method is called the **adagrad** method (which is short for *adaptive gradient*).
This simple adjustment to gradient descent keeps a running measure of the size of the gradients $\frac{d\mathcal{L}}{dR}$ and $\frac{d\mathcal{L}}{dL}$ and slowly decays the size of `alpha` using it.
It can be written as:

$$
\begin{bmatrix}
R_{n+1} \\ L_{n+1}
\end{bmatrix} =
\begin{bmatrix}
R_{n} - \frac{\alpha}{\sqrt{\gamma_n}} \frac{d\mathcal{L}}{dR}(R_n) \\
L_{n} - \frac{\alpha}{\sqrt{\delta_n}} \frac{d\mathcal{L}}{dL}(L_n)
\end{bmatrix}
$$
where
$$
\begin{align*}
\gamma_{n+1} = \gamma_{n} + \frac{d\mathcal{L}}{dR}(R_n)^2 \\
\delta_{n+1} = \delta_{n} + \frac{d\mathcal{L}}{dL}(L_n)^2
\end{align*}
$$

We can adjust the previous `gradient_descent` function to easily incorporate this as a new function `adagrad` as follows:

----
#### Exercise 4
Write a function `adagrad(p0, df, alpha=0.1, max_iter=100)` that implements the adagrad method. Use it to find the optimal $R$ and $L$ values as is done above and make an animation to demonstrate the solution. How large can `alpha` be in this new method?

*Solution:*


----