# Gauss Newton

Gauss-Newton is an algorithm for solving [non-linear least squares](https://en.wikipedia.org/wiki/Non-linear_least_squares) problems. It is an [SQP](https://en.wikipedia.org/wiki/Sequential_quadratic_programming) method which means it is an iterative algorithm and that it opperates on the [QP](https://en.wikipedia.org/wiki/Quadratic_programming) subproblem. 

## Introduction: 

Consider that one has a residual function $r(\theta)$ that represents the error of each data point with the respect to the model parameters($\theta$). Now this function may be non-linear and its nature may be unpredictable. The idea of Gauss Newton is to approximate this function of unknown characteristics with a quadratic. This quadratic is formulated using the taylor series expansion of a matrix based function. Now, since the residual function represents error, naturally the overall goal is to minimize this function. As we know from basic calculus, taking the derivative of a function and setting it equal to zero will yield an extrema(possible local or global maxima or minima) of that function. 

### Theory:

The following is the 2 term taylor series expansion of a multiple variable function. 

$$ T(x) = f(a) + (x-a)^T \nabla f(a) + \frac{1}{2!}(x-a)^T (\nabla^2 f(a))(x-a)   $$

Next substitute $a = \theta-p$ and the expression becomes:

$$ T(\theta + p) = f(\theta) + p^T \nabla f(\theta) + \frac{1}{2!}p^T (\nabla^2 f(\theta))p   $$

Recall, to minimize a function(with respect to a parameter $p$), one takes the derivative(with respect to the parameter $p$) and sets it equal to zero. Finally, one solves for that parameter . In this case, we are solving for the step($p$) that takes the function to a minimum. 

$$ 0 = \nabla f(\theta) + \nabla^2 f(\theta) p$$

$$ p = -(\nabla^2 f(\theta))^{-1} \nabla f(\theta)  $$

The first and second order derivatives of a residual function can be approximated as follows: 

$$ \nabla f = J^T r $$

$$ \nabla^2 f \approx J^T J $$

Which yields the following step:

$$ p = -(J^T J)^{-1} J^T r $$

$$ \theta_{k+1} = \theta_{k} + p $$

### Implementation:

It follows then if one updates iteratively $ \theta $ by $p$, one would reach the minimum error. The following is some basic code to implement this concept in matlab/octave: 

```matlab
function theta = gaussNewton(Rfnc,theta,itLimit)
for i = 1:itLimit;
    J = Jf(Rfnc,theta); % Calc R
    r = Rfnc(theta); % Calc J 
    g = (J')*r;
    Hinv = pinv((J')*J);
    p = -Hinv*g; % Calc the newton step
    theta = theta + p; % Update Theta
end
end
```

Note this function makes use of ``Jf `` which is a function to compute the jacobian through [finite differences](https://en.wikipedia.org/wiki/Finite_difference) . The code for this as follows: 

```matlab 
function J=Jf(fnc,params)

eps = 1e-8;
x1 = fnc(params);
m = size(x1,1);
n = length(params);
J = zeros(m,n);

for i = 1:n
    paramCpy = params; 
    paramCpy(i)= paramCpy(i) + eps;
    J(:,i) = (fnc(paramCpy) - x1)/eps;
end

end
```


#### Limitations: 

It is important to note that this method is intrinsically not perfect because of the following reasons:
1. A taylor series expansion is an approximation. It becomes less accurate the further $x$ is away from $a$. 
2. This method relies on an initial guess of $\theta_0$. The performance of the algorithm is subject to that initial guess.
3. $J^T J$ is an approximation of the hessian and $J^T$ itself is often approximated as well. 
4. This method is often is operating on nonlinear objective functions which are not convex. 

Practically speaking these limitations mean the following: 
1. With a poor initial guess, the algorithm may not converge. 
2. With a poor initial guess, the algorithm may not reach a desired global optimum. 
3. With a highly nonlinear function, the algorithm may be incapable of reaching the desired global optimum.

### Example 1: Linear Example

In the linear case, Gauss-Newton converges in one step since the quadratic approximation can perfectly model a strictly convex function. Essentially it is no different than the normal equation. So consider the following basic linear model:

$$ f(x) = \theta_1 \cdot sin(t) + \theta_2 $$


Next, take the gradient of this function: 

$$ \frac{\partial f}{\partial \theta} = \bigg[ sin(t) \ , \ 1 \bigg] $$

Clearly, this is a linear convex problem since no parameters($\theta$) are within the jacobian / gradient. Now consider an example where the true model is $ [\theta_1 , \theta_2] = [5,5]$ and where the initial guess is  $ \theta_0 = [-10,10]$ . Below is the basic code and output of this example: 

<p align='center'><img src='Images/Nonlinear/GN_1.png'></p>

Above demonstrates the results of GaussNewton after 10 iterations though it converged on the first iteration given its a linear / convex problem. The following is the cost plot which is a further demonstration of its convex nature: 

<p align='center'><img src='Images/Nonlinear/GN_2.png'></p>

Here is the code that generated these plots: 

```matlab
clc
t = (0:0.1:10)'; % This is the time samples 
y = 5*sin(t)+5; % This is the true model
noise = 2*randn(length(t),1); % Noise model 
yM = y + noise; % This is the measurement. 

rFncs = @(T) yM - (T(1)*sin(t)+T(2)); 
theta = [-10;10];
theta = gaussNewtonNL(rFncs,theta,10);

theta1 = 0.1:0.1:10;
theta2 = 0.1:0.1:10;
Z = zeros(length(theta1),length(theta2));
for i =1:length(theta1)
    for j = 1:length(theta2)
       Z(i,j) =  norm(rFncs([theta1(i);theta2(j)]));
    end
end

fig1 = figure(1);
clf(fig1);
hold on
scatter(t,yM);
plot(t,y)
plot(t, theta(1)*sin(t)+theta(2))

grid on
set(gca,'FontSize',10,'FontWeight','bold');
set(gcf,'Units','Pixels');
set(gcf, 'Position', [2500, 500, 750, 450]);
legend('Measured y', 'True y','Gauss Newton','Least Squares')
title('Gauss Newton: \theta_1 sin(t) + \theta_2')

fig2 = figure(3);
surf(theta1,theta2,Z,'EdgeColor','none');
xlabel('\theta_1')
ylabel('\theta_2')
title('Cost Function')
```


### Example 2: Nonlinear model 

Consider the following example where one must identify $ \theta_1 , \theta_2$ of the following function: 

$$ f(t) = \theta_1 sin( \frac{t}{ \theta_2}) $$ . 

Unfortunately, this parameter identification represents a nonlinear estimation problem. The gradient of this function is as follows: 

$$ \frac{\partial f}{\partial \theta} = \bigg[ sin(\frac{t}{ \theta_2}) \ , \ -\frac{\theta_1}{ \theta_2^2}  cos( \frac{t}{\theta_2})\bigg] $$

Clearly this is a non-linear non convex parameter estimation problem since the gradient/jacobian consists of parameters themselves. Consequently, this problem is adequate candidate for Guass-Newton. Consider an example where the true model parameters are again $ [\theta_1 , \theta_2 ] = [5,5] $ and an initial guess of $ \theta_0  = [1,1] $. If one makes the simple modifificaiton to the previously presented code they will note that the code **fails to converge to the correct parameters**. This is because estimation problem is **non convex**. What this means is that the cost function may converge to a sub optimal solution depending upon the initial guess. The non-convexity is obvious when you look at the cost plot. All SQP optimization algorithms are like dropping a marble into a surface the shape of the cost function. If the marble is not started in the right place, the marble may rest at a place with more cost than the global optimum. 

<p align='center'><img src='Images/Nonlinear/GN_3.png'></p>

This is where engineering intuition comes into play. Coming up with a good initial guess is usually possible and is the art that makes non-linear optimization reliable. In this case by looking at the cost plot, one can tell if they make the initial guess of $\theta_2$ large, it will always converge to a global optimum. For completeness, here is an example where $ \theta_0  = [10,10] $ . 

<p align='center'><img src='Images/Nonlinear/GN_4.png'></p>

As observed, the algorithm converges. However, if you play around with the code, you will realize that the algorithm is very sensitive to that initial guess. This is the weakness of GaussNewton which is why at bare minimum you should use the [LevenbergMarquardt](https://en.wikipedia.org/wiki/Levenberg%E2%80%93Marquardt_algorithm) Algorithm. That algorithm is simply a damped version of Gauss Newton that limits the size of the steps based upon performance of the steps. I will document that algorithm next. 



