## Introduction to the Inverse Problem
Notes are from _Parameter Estimation and Inverse Problems_ 3rd ed by Aster et al.

---
### Ch. 1 - Intro

In engineering problems we have physical parameters characterizing a model function $m$ to obersvational data, or function of data, $d$:

$$
G(m)=d
$$

where $G$ is an operator when $m,d$ are functions, and is a function when $m,d$ are vectors.

* The **forward problem** is to find $d$ given $m$, like solving a ODE/PDE or evaluating an integral.
* The **inverse problem** is to find $m$ given $d$.
* The **model identification problem** is to find $G$ given $m$ and $d$. This relates to finding the right kernels in GPs.

Let $n$-element vector $M$ express the model parameters, and let the $m$-element vector (number of datapoints) $D$ express the data set. These are **parameter estimation problems**:

$$
G(M)=D
$$

<br><br>

Example 1: Parabolic trajectory problem. Given datapoints $(t,y)$, find the model parameters $m_1,m_2,m_3$ such that $y(t) = m_1+m_2t-(1/2)m_3t^2$. Here, $n=3$ parameters and $m$ observations, which we consolidate in a $m$x$n$ design matrix. Since $m>n$, then the $m$ constraint equations may be **inconsistent**, meaning that we can't find a model that fits the data exactly (linear interpolation) given noise in either $y$ and/or $t$ or other factors we haven't considered. This means we must approximate the data using a best fit, with best varying according to your optimization methodology. Commonly in linear regression we use least squares, which minimizes the **2-norm** (Euclidean distance) of the residuals (observations-expected). However, a **1-norm** (absolute distance) may be more resistant to outliers, making the estimation more _robust_.

<br>

Example 2: _Fredholm integral equation of the first kind_: $d(t_i) = \int_a^b G(t_i,s) m(s) ds$ where we're given $d(t_i), G, t_i$, $i=1,\ldots,m$ to find $m(s)$. If we let $t_j=a+\Delta t/2 + (j-1)\Delta t$, where $\Delta t = (b-a)/n$, then we can approximate the integral as

$$
d(t_i) = \sum_{j=1}^n g(t_i,t_j) m(t_j)\Delta t
$$

Letting $G$ be the $m$x$n$ matrix with components $G_{i,j}=g(t_i,t_j)\Delta t$, we have the linear system $Gm=d$



---
### Ch. 2 - Linear Regression
#### L2 Regression
The problem of finding a parameterized curve to fit a dataset is **regression**. When the model is linear in parameters, it's a **linear regression** problem.

Say we have a $m$-element data vector $D$ of observations ($y$) and a $n$-parameter vector $M$. The forward problem is the linear system

$$
GM = D
$$

We assume $G$ is full column rank (rank is $n$) so that a unique solution exists. Since $m>n$, only approximate solutions can be found (hence regression and not interpolation). The **residual vector** between the observations and model estimates is $R=D-GM$. We commonly use the 2-norm of the residuals to measure the misfit, and the model whose parameters minimize the 2-norm is the _unbiased_ **least squares solution**. The LSE solution is:

$$
\hat{M}_{L2} = (G^TG)^{-1}G^TD
$$

<br><br>

The **likelihood function** $L$ is given by the _joint probability density function of $D$ given $M$:

$$
L(M\vert D)=f(D\vert M)=f_1(D_1\vert M)f_2(D_2\vert M)\ldots f_m(D_m\vert M)
$$

The maximum likelihood is the model $M$ that maximizes the likelihood. If the linear inverse problem has independent and normally distributed data errors, the MLE is the LSE solution. The MLE problem becomes the **weighted LS problem**:

$$
\min\sum_{i=1}^m \frac{(d_i-(GM)_i)^2}{\sigma_i^2}
$$

The **chi-square** statistic provides a useful info about the quality of model estimates:

$$
\chi^2 = \sum_{i=1}^m \frac{(d_i-(GM)_i)^2}{\sigma_i^2}
$$

with $v=m-n$ DF. The probability of obtaining a $\chi^2$ vlue as large or larger than the observed value $x$ is the **p-value** of the test:

$$
p=\int_{\chi^2}^\infty f_{\chi^2}(x)dx
$$

* If the p-value is very close to $0$, then our model $GM=D$ is incorrect or the data errors are not normally distributed.
* If the p-value is very close to $1$, then we may have overestimated the dataerrors.
* If the p-value is in between, it will be uniformly distributed between $0$ and $1$ when we have a correct model and data assumptions.

If the residuals show sysematic patterns that is also an indication of a wrong model.

<br><br>

We can use 95% confidence intervals for our parameter estimates. We use cofidence regions for a set of parameters as they are larger than the individual parameter intervals.


#### L1 regression
We use this if we have **outliers**. Now our model solution is the solution that minimizes the 1-norm of the residuals:

$$
\hat{M}_{L1} = f(M) = \sum_{i=1}^m \frac{\vert D_i-(GM)_i\vert}{\sigma_i} = \vert D-GM \vert_1
$$

One way to solve this numerically is to use the **iteratively reweighted least squares (IRLS)** algorithm

**Monte Carlo Error Propagation**: for solution techniques that are nonlinear/algorithmic, such as IRLS, we can't analytically propagate uncertainties in the data to uncertainties in the estimated model parameters unless we aply MC error propagation. For example, we first approximate the covariance matrix by solving the IRLS problem to the noise-free baseline ($b$) data vector $GM_{L1}=D_b$. Then we re-solve IRLS many ($q$) times for indepenednet noise realizations $GM_{L1}=D_b+\gamma_i$. Let $A$ be the $q$x$n$ matrix where the $i$th row contains the difference between the ith model estimate and the average model: $A_{i,.}=M_{L1,i}^T-\bar{M}_{L1}^T$. Then the empirical covariance matrix is 

$$
Cov(M_{L1})=\frac{A^TA}{q}
$$
