# Levenberg Marquardt

### Introduction: 

The [Levenberg Marquardt](https://en.wikipedia.org/wiki/Levenberg%E2%80%93Marquardt_algorithm) algorithm a basic practical algorithm for solving [nonlinear least-sqaure](https://en.wikipedia.org/wiki/Non-linear_least_squares) problems. A better fitting name for this algorithm is damped gauss newton for it combines Gauss-Newton and Gradient Descent. If you have not already, please see my tutorials for [GuassNewton](GaussNewton.ipynb) and [Gradient Descent](GradientDescent.md) for this tutorial builds heavily off of concepts introduced in those algorithms. 

### Notation:

For clarity, this tutorial uses the following notation. Do not be alarmed if you are un-aware of any of the terms described below for they will be introduced in this tutorial. 

$$ m \ = \ number \ of \ samples  $$ 
$$ n \ = \ number \ of \ parameters $$ 
$$ \theta \ = \ model \ parameters $$ 
$$ r_{m,1} \ = \ residual \ = \ y - \hat y $$ 
$$ J_{m,n} \ = \ jacobian \ =  [ \nabla r ] $$ 
$$ p \ = \ step \ \  update \ \ to \ \  \theta  $$ 

### Theory: 

While [GuassNewton](GaussNewton.ipynb)(GN) is a basic nonlinear least squares(NLSQ) algorithm, it often fails to converge unless the initial conditions are very close to the correct solution. Likewise, [Gradient Descent](GradientDescent.md)(GD) can be another algorithm to solve NLSQ but it often takes far too many iterations to converge. Consequently, this algorithm attempts to blend those two algorithms together in order to achieve the following goals:  

* Robustness against a poor initial guesses
* Relatively quick convergence / low total iteration count

Practically speaking, Levenberg Marquardt is the most basic ** practical ** algorithm for solving NLSQ since the drawbacks of GN and GD often eliminate their application in industry. As mentioned previously, this algorithm combines GN and GD. Recall the GD and GN steps: 

$$ p_{GD} = \alpha \frac{J^T r}{ || J^T r ||} $$ 
$$ p_{GN} = -(J^T J)^{-1} J^T r  $$ 

The idea behind Levenberg Marquardt(LM) is to switch between these two steps based upon the performance of $r_k$ for each $p_k$ step. If the GN step reduce error($ || r || $) it is best to use it and if it increases the error, it is better to take a GD step. Rather than hard switching between these two algorithms, LM introduces a variable $\lambda$ that is heuristically updated based upon performance. The LM step is as follows: 

$$ p_{LM} = -(J^T J + \lambda I)^{-1} J^T r $$

The rational for this is if $\lambda$ is large, the term $ (J^T J + \lambda I)^{-1} $ simply becomes the identity matrix($I$) multiplied by some small scalar. This is equivalent of the GD step. Likewise, if $\lambda$ becomes 0, it is obviously the GN step. The next question then becomes how to update $\lambda$ ? This is done simply adjusting it based upon improvement in reduction in error. If the model error of a given iteration($k$) has decreased, reduce $\lambda_{k+1} = 0.1 \cdot \lambda_k $ and apply the step ($ \theta_{k+1} = \theta_k + p_k $). If the model error is increased,  increase lambda ($\lambda_{k+1} = 10 \cdot \lambda_k $) and do not apply the step ($ \theta_{k+1} = \theta_k $) . All together the algorithm can be summarized in the following steps: 

1. Calculate the residual: $r(\theta_k)$.
2. Calculate the jacobian: $J(\theta_k)$.
3. Calculate the LM step $p_k$
4. Calculate the new cost $r(\theta_k + p_k)$
5. IF ( $r(\theta_k + p_k)$ < $r(\theta_k)$): A, else B: 
   1. Decrease Lambda: ( $\lambda_{k+1} = 0.1 \cdot \lambda_k $). Update theta: ($ \theta_{k+1} = \theta_k + p_k $)
   2. Increase Lambda: ($\lambda_{k+1} = 10 \cdot \lambda_k$). Do not update theta:  ($ \theta_{k+1} = \theta_k $)
6. Determine if algorithm should terminate. 

Here is matlab/octave code to implement the above algorithm: 

```matlab 
function theta = LMS(Rfnc,params,iterations)

alpha = 1;
theta = params;
oldCost = norm(Rfnc(theta));

for i =1:iterations;
    r = Rfnc(theta);
    J = Jf(Rfnc,theta);
    p = -pinv(J'*J + alpha*eye(length(params)))*J'*r;
    newCost = norm(Rfnc(theta+p));
    if(newCost<oldCost)
        theta = theta+p;  
        oldCost = newCost;
        alpha =0.1*alpha;
    else
        alpha = 10*alpha;
    end
end

end

```

In this algorithm, `` Rfnc `` is a function pointer to the residual function which should be $r(\theta) = y - f(\theta)$. ``Jf `` is a function to calculate the Jacobian. For simplicity in this case, the Jacobian is calculated through finite differences as opposed to the analytical derivative. Naturally this can be more computationally exspensive. The code for ``Jf`` is below.
```matlab
function J=Jf(fnc,params)

eps = 1e-8;
x1 = fnc(params);
m = size(x1,1);
n = length(params);
J = zeros(m,n);

for i = 1:n
    paramCpy = params; 
    paramCpy(i)= paramCpy(i) + eps;
    J(:,i) = (fnc(paramCpy) - x1)/eps;
end

end
```

### Example 1: 

As an example, consider a model of a sinusoid with unkown amplitude and frequency. 

$$ f(t) = \theta_1 sin( \frac{t}{ \theta_2}) $$ 

This model is convinently two variables(for plotting purposes) as well as non-linear with respect to the parameters. This non-linearity is apparent when observing the jacobian.  

$$ \frac{\partial f}{\partial \theta} = \bigg[ sin(\frac{t}{ \theta_2}) \ , \ -\frac{\theta_1}{ \theta_2^2}  cos( \frac{t}{\theta_2})\bigg] $$

This non-linearity / non-convexity further observed in a plot of the cost function: 

<p align='center'><img src='Images/Nonlinear/GN_3.png'></p>

Previously, this problem was evaluted in my seciton on [GuassNewton](GaussNewton.ipynb) and it was stated that the initial guess ($\theta_0$) must be close to the solution in order for convergence. In contrast, the Levenberg Marquardt algorithm is much more robust. The following is an example with an initial guess of $\theta_0 = [100;100]$. 

<p align='center'><img src='Images/Nonlinear/LM_1.png'></p>

The following is the code. Note you will need to create the LMS function as the Jf function from the code provided above. 

```matlab
clc
t = (-20:0.5:20)';
y = 5*sin(-t/5);
noise = 1*randn(length(t),1);
yM = y + noise; % This is the measurement. 
iterations = 50; % Fixed iteration length 
theta = [100;100];
rFncs = @(T) yM - (T(1)*sin(-t/T(2))); 

thetaGN = gaussNewtonNL(rFncs,theta,iterations);
thetaLM = LMS(rFncs,theta,iterations);

fig1 = figure(1);
clf(fig1);
hold on
scatter(t,yMeasured);
plot(t,y)
plot(t,thetaGN(1)*sin(-t/thetaGN(2)) )
plot(t,thetaLM(1)*sin(-t/thetaLM(2)) )

grid on
set(gca,'FontSize',10,'FontWeight','bold');
set(gcf,'Units','Pixels');
set(gcf, 'Position', [2500, 500, 750, 450]);
legend('Measured y', 'True y','Gauss Newton','Levenberg Marquardt')
title(['NLSQ: \theta_1 * sin(t / \theta_2)',' . \theta_0 = [',num2str(theta(1)),',',num2str(theta(2)),']'])
```

Of course, this single example does not prove much however it illustrates an example where LM converged where GN did not. To further illustrate this point, consider extending the initial guess through a range of values and take a statistical sampling of convergence. In the following example, $\theta_2$ is swept from -2000 to 3000. Each experiment was performed 100 times to ensure convergence was not due to noise.   

<p align='center'><img src='Images/Nonlinear/LM_2.png'></p>

The results rather painfully demonstrate the inadequacy of GN in practical applicaiton. In constrast, in this specific example, LM's initial guess of $\theta_2$ could be off by 3 orders of magnitude. In practice, how far off the initial guess can saftely be will be dictated by the convexity of the cost plot. None the less, it demonstrates the potential reliablity of LM if the problem is setup efficently and a somewhat intelligent initial guess is made.




