# Inverse problem

---
The entire forward problem of spectral line formation and observation, as described above, can also be formulated from a probabilistic point of view, as determining the probability distribution, $p(\boldsymbol{o} \, | \, \boldsymbol{m})$, over all possible observations $\boldsymbol{o}$, given a model $\boldsymbol{m}$.
Classical numerical methods typically only consider a single solution, $\boldsymbol{o} = f(\boldsymbol{m})$, which corresponds to a Dirac delta distribution, $p(\boldsymbol{o} \, | \, \boldsymbol{m}) = \delta\big(\boldsymbol{o} - f(\boldsymbol{m})\big)$.
With probabilistic numerical methods, however, the entire probability distribution can be obtained (see e.g. [De Ceuster et al. 2023](https://ui.adsabs.harvard.edu/abs/2023MNRAS.518.5536D/abstract)).
This probabilistic approach is key, for instance, to quantify uncertainties in the forward problem, but can also help to solve the (classically non-invertible) inverse problem.
Although we might not be able to invert the forward function, $f$, we can always determine the probability of the inverse situation, $p\left(\boldsymbol{m} \, | \, \boldsymbol{o}\right)$, with Bayes' rule,
\begin{equation*}
p\left(
\boldsymbol{m} \, | \, \boldsymbol{o}
\right)
\ = \
\frac{
p\left(\boldsymbol{o} \, | \, \boldsymbol{m}\right)
p\left( \boldsymbol{m} \right)
}
{
p\left( \boldsymbol{o} \right)
} .
\end{equation*}
This allows us to determine the probability distribution over possible models, $\boldsymbol{m}$, corresponding to an observation, $\boldsymbol{o}$.
Since the denominator does not depend on the model, $\boldsymbol{m}$, we can treat it as a mere normalisation constant and only concentrate on the numerator.
Here, we find the likelihood, $p\left(\boldsymbol{o} \, | \, \boldsymbol{m}\right)$, which is related to the forward problem, and the prior, $p(\boldsymbol{m})$, which encodes our assumptions about the model, prior to the observation. 
Just like the forward model, $f$, can be viewed as the maximum of the likelihood, $p\left(\boldsymbol{o} \, | \, \boldsymbol{m}\right)$, the inverse, i.e.\ a reconstruction of a model based on an observation, can be viewed as the maximum of the posterior, $p\left(\boldsymbol{m} \, | \, \boldsymbol{o}\right)$.
Determining this maximum is equivalent to minimising the negative logarithm of the posterior,
\begin{equation*}
-\log p\left(
\boldsymbol{m} \, | \, \boldsymbol{o}
\right)
\ = \
-\log p\left(\boldsymbol{o} \, | \, \boldsymbol{m}\right)
\ - \
\log p\left( \boldsymbol{m} \right)
\ + \
\log p\left( \boldsymbol{o} \right) .
\end{equation*}
Since we want to minimise this over different models, $\boldsymbol{m}$, but for a fixed observation, $\boldsymbol{o}$, the last term will be constant and thus can be neglected.
In this optimisation problem, we distinguish three types of objectives or loss functions, i.e. the functions we aim to minimise,
\begin{align*}
p\left(\boldsymbol{m} \, | \, \boldsymbol{o}\right)
\ &\equiv \
\exp
\big(
-\mathcal{L}_{\text{tot}}\left(\boldsymbol{m}, \boldsymbol{o} \right)
\big) , \\
p\left(\boldsymbol{o} \, | \, \boldsymbol{m}\right)
\ &\equiv \
\exp
\big(
-\mathcal{L}_{\text{rep}}\left(f(\boldsymbol{m}), \boldsymbol{o} \right)
\big) , \\
p\left(\boldsymbol{m}\right)
\ &\equiv \
\exp
\big(
-\mathcal{L}_{\text{reg}}\left(\boldsymbol{m} \right)
\big) ,
\end{align*}
in which, $\mathcal{L}_{\text{rep}}$ is the reproduction loss, $\mathcal{L}_{\text{reg}}$ is the regularisation loss, and, $\mathcal{L}_{\text{tot}}$, is the total loss.
Each of these loss functions quantifies the deviation from an objective.
Hence, neglecting the last term, this can be written as,
\begin{equation}
\mathcal{L}_{\text{tot}}\left(  \boldsymbol{m},  \boldsymbol{o} \right)
\ = \
\mathcal{L}_{\text{rep}}\left(f(\boldsymbol{m}), \boldsymbol{o} \right)
\ + \
\mathcal{L}_{\text{reg}}\left(  \boldsymbol{m} \right) .
\end{equation}
These equations allow one to translate a probabilistic problem about probability distributions into an optimisation problem with loss functions, and vice versa.

## Reproduction loss / likelihood

---
The reproduction loss, $\mathcal{L}_{\text{rep}}$, is a measure on the space of observations that quantifies how badly a model fits the observation by measuring the discrepancy between a synthetic observation of a model, $f(\boldsymbol{m})$, and the true observation, $\boldsymbol{o}$.
In this paper,  we consider a typical reproduction loss given by the mean squared error, weighted by a covariance matrix, $\boldsymbol{\Sigma}$, such that,
\begin{equation*}
\mathcal{L}_{\text{rep}}\big(f(\boldsymbol{m}), \boldsymbol{o} \big)
\ = \
\frac{1}{2} \,
\big(f(\boldsymbol{m}) - \boldsymbol{o}\big)^{\text{T}} \
\boldsymbol{\Sigma}^{-1} \,
\big(f(\boldsymbol{m}) - \boldsymbol{o}\big) .
\end{equation*}
With this definition of the reproduction loss, the likelihood, up to a normalisation constant, corresponds to a multivariate Gaussian distribution.
This distribution can be used to represent the uncertainty in the observations, where the square root of the diagonal of the covariance matrix models the uncertainty per pixel or visibility.


For simplicity, in this paper, we omitted the noise on the observations, i.e. $\boldsymbol{\Sigma} = \mathbb{1}$.
However, we did find that our reconstruction method performs significantly better by splitting the reproduction loss into a averaged and a relative part,
\begin{equation*}
\mathcal{L}_{\text{rep}}\big(f(\boldsymbol{m}), \boldsymbol{o} \big)
\ = \
\mathcal{L}_{\text{rep}}\Big( \big\langle f(\boldsymbol{m}) \big\rangle, \, \left\langle\boldsymbol{o}\right\rangle \Big)
\ + \
\mathcal{L}_{\text{rep}}\left( \frac{f(\boldsymbol{m})}{\big\langle f(\boldsymbol{m})\big\rangle}, \, \frac{\boldsymbol{o}}{\left\langle \boldsymbol{o}\right\rangle} \right) ,
\end{equation*}
where the brackets, $\langle\cdot\rangle$, denote an arithmetic mean along an axis of the data, for instance, the frequency. For images, when considering the mean along the frequency axis, this implies that we add the loss for the frequency-averaged intensities in each pixel and the loss for the frequency-normalised intensity in each pixel. 
In this way, all pixels contribute equally, at least in the relative part of the loss (i.e. the second term).
Without this, the algorithm has difficulty reconstructing dimmer regions in the observations, since their contribution to the loss is overpowered by the contributions of brighter regions.

## Regularisation loss / prior

---
The regularisation loss, $\mathcal{L}_{\text{reg}}$, is a measure on the space of models that quantifies how well a model fits our prior assumptions.
We consider a regularisation loss, or a corresponding prior, $p(\boldsymbol{m})$, that consists of different parts, each encoding a different assumption about our model.
Below, we will present the types of regularisation that we consider in pomme.
However, not every assumption will always be necessary.
Different combinations can be used for different reconstructions.

Often when solving inverse problems, to avoid over-fitting, one assumes a certain degree of regularity or smoothness of the solution.
In this paper, for each model parameter, $q(\boldsymbol{x})$, with a spatial dependence, $\boldsymbol{x}$, we use the integrated squared Euclidean norm of its gradient,
\begin{equation}
    \mathcal{L}_{\text{reg}}[q] \ = \ \int \text{d} \boldsymbol{x} \ \| \nabla q(\boldsymbol{x})\|^{2} ,
\end{equation}
to quantify its deviation from smoothness.
The main reason for this particular choice of smoothness measure is its straightforward numerical implementation.

For some observations it can also be useful to make assumptions about symmetries, and, in particular, about spherical symmetry. 
Therefore, we implemented a loss function that can quantify deviations from spherical symmetry.
Given an origin point in the model, $\boldsymbol{x}_{O}$, this loss quantifies the average variance within a predetermined set of spherical shells around the origin point,
\begin{equation}
    \mathcal{L}_{\text{sph}}[q]
    \ = \
    \int_{0}^{\infty} \text{d} r \
    \mathbb{V}\big[q(\boldsymbol{x}) \, \big| \, \|\boldsymbol{x}-\boldsymbol{x}_{O}\|=r \big] .
\end{equation}

Next, we consider a loss that can encode the physical laws that govern the variables in our model, i.e. the distributions of $\rho$, $\boldsymbol{v}$, and $T$.
Not every configuration of $\rho$, $\boldsymbol{v}$, and $T$ is equally likely to occur, since we expect any configuration to be the result of hydrodynamic evolution from some initial conditions.
We assume this hydrodynamic evolution to be governed by the conservation of mass, momentum, and energy, which can be formalised, in Eulerian form, as,
\begin{align*}
\frac{\partial \rho}{\partial t} \ + \ \nabla \cdot \left( \rho \, \boldsymbol{v} \right) \ &= \ 0 , \\
\frac{\partial \boldsymbol{v}}{\partial t} \ + \ \left( \boldsymbol{v} \cdot \nabla \right) \boldsymbol{v} \ + \ \frac{1}{\rho} \, \nabla P \ + \ \nabla \Phi \ &= \ 0 , \\
\frac{\partial E}{\partial t} \ + \ \nabla \cdot \big( \left(E + P\right) \boldsymbol{v} \big) \ + \ \Lambda \ &= \ 0 . 
\end{align*}
Here, we defined the total internal energy,
\begin{equation*}
E \ = \ \frac{1}{2} \, \rho \, \boldsymbol{v}^{2} \ + \ \rho \, u,
\end{equation*}
consisting of a kinetic term and a thermal internal energy, $u$, which can be related to the pressure, $P$, assuming an equation of state,
\begin{equation*}
P \ = \ \left( \gamma - 1 \right) \rho \, u ,
\end{equation*}
where $\gamma$ is the adiabatic index, which is related to the internal degrees of freedom in the gas.
The pressure, $P$, in turn, can be related to the temperature, $T$, through the ideal gas law,
\begin{equation*}
P \ = \ \frac{k_{\text{B}}}{\mu} \, \rho \, T ,
\end{equation*}
in which $k_{\text{B}}$ is Boltzmann's constant and $\mu$ is the mean molecular weight of the gas.
The remaining components, $\Phi$ and $\Lambda$, describe the gravitational potential and the cooling function respectively. 
Often, there is, for instance, a central object for which we have a mass estimate, $M$, and an estimated location, $\boldsymbol{x}_{\text{grav}}$\footnote{Model parameters, such as $M$ and $\boldsymbol{x}_{\text{grav}}$ can also be considered as free parameters that can be fitted to the observations by minimising the loss (maximising the posterior).}.
As a result, we can, for instance, assume a gravitational potential of the form,
\begin{equation*}
\Phi(\boldsymbol{x}) \ = \ - \frac{GM}{\| \boldsymbol{x} - \boldsymbol{x_{\text{grav}}} \|} .
\end{equation*}
Note that this ignores the self-gravitation of the density distribution $\rho$.
The cooling function is often more difficult to estimate, since it depends on many other parameters of the astrophysical object.
Without additional prior knowledge, one can assume, for instance, $\Lambda = 0$.


The hydrodynamics equations provide five component equations that describe the time-evolution of the five components of our model variables, $\rho$, $\boldsymbol{v}$, and $T$.
However, we do not aim to describe the entire time evolution, but rather a snapshot in time, based on an observation.
As a result, our models lack time dependence and we cannot simply enforce the hydrodynamic equations as prior assumptions.
Instead, we want to know, at any given time, what is the most likely state to find our model in, given that its evolution is governed by the hydrodynamic equations.
One way to do this is to consider time averages of each time-dependent variable, $q(t)$, defined as,
\begin{equation*}
\left\langle q \right \rangle_{T}
\ \equiv \
\frac{1}{T} \int_{0}^{T} \text{d} t \ q(t) .
\end{equation*}
For any bounded function, i.e.\ there exist $q_{\min}, q_{\max} \in \mathbb{R}$, such that $q_{\min} \leq q(t) \leq q_{\max}$, for sufficiently large time intervals, i.e. $T \rightarrow \infty$, one can easily show that the time average of the time derivative of the function vanishes, i.e.
\begin{equation*}
\left|
\lim_{T \rightarrow \infty}
\left\langle \frac{\partial q}{\partial t} \right \rangle_{T}
\right|
\ = \
\left|
\lim_{T \rightarrow \infty}
\left(
\frac{1}{T} \int_{0}^{T} \text{d} t \ \frac{\partial q}{\partial t}
\right)
\right|
\ \leq \
\lim_{T \rightarrow \infty}
\left(
\frac{q_{\max} - q_{\min}}{T}
\right)
\ = \ 0 .
\end{equation*}
As a result, one can argue that the time-averaged state of a model, which we use as an estimator for the expected state of a model at any time, is a steady-state solution, i.e.\ a solution with vanishing time derivatives, $\partial_{t} \rho = 0$, $\partial_{t} \boldsymbol{v} = 0$, and $\partial_{t} T = 0$.
It is also quite intuitive that it is more likely to observe a system in a state in which its time evolution is slow, since it spends comparatively more time in those states.
Assuming a steady state, the hydrodynamic equations can be rewritten in terms of our model variables, $\rho$, $\boldsymbol{v}$, and $T$,
\begin{align*}
\nabla \cdot \left( \rho \, \boldsymbol{v} \right) \ &= \ 0, \\
\left( \boldsymbol{v} \cdot \nabla \right) \boldsymbol{v} \ + \ \frac{k_{\text{B}}T}{\mu} \, \nabla \big( \log \rho  \, + \, \log T \big) \ + \ \nabla \Phi \ &= \ 0, \\
\rho \, \boldsymbol{v} \cdot \nabla \left( \frac{1}{2} \, \boldsymbol{v}^{2} \ + \ \frac{\gamma}{\gamma-1} \frac{k_{\text{B}}T}{\mu} \right) \ + \ \Lambda \ &= \ 0 .
\end{align*}
Since these equations only contain (spatial derivatives of) our model variables, we can enforce them as an assumption on our model by defining the following loss functions,
\begin{align*}
\mathcal{L}_{\rho}(\boldsymbol{m})
\ &\equiv \
\int \text{d}\boldsymbol{x} \ \frac{1}{\rho^{2}} \big( \nabla \cdot \left( \rho \, \boldsymbol{v} \right) \big)^{2} , \\
\mathcal{L}_{v_{i}}(\boldsymbol{m})
\ &\equiv \
\int \text{d}\boldsymbol{x} \ \frac{1}{v_{i}^{2}}  \left( \left( \boldsymbol{v} \cdot \nabla \right) v_{i} \ + \ \frac{k_{\text{B}}T}{\mu} \, \nabla_{i} \big( \log \rho  \, + \, \log T \big) \ + \ \nabla_{i} \Phi \right)^{2} , \\
\mathcal{L}_{E}(\boldsymbol{m}) \ &\equiv \
\int \text{d}\boldsymbol{x} \ \frac{1}{E^{2}} \left( \rho \, \boldsymbol{v} \cdot \nabla \left( \frac{1}{2} \, \boldsymbol{v}^{2} \ + \ \frac{\gamma}{\gamma-1} \frac{k_{\text{B}}T}{\mu} \right)
\ + \ \Lambda \right)^{2} .
\end{align*}
We included normalisation factors to ensure that all loss functions have the same unit of inverse time.
These loss functions can be combined to define a regularisation loss,
\begin{equation*}
\mathcal{L}_{\text{reg}}\left(  \boldsymbol{m} \right)
\ = \
w_{\rho} \, \mathcal{L}_{\rho} \left(  \boldsymbol{m} \right)
\ + \
\boldsymbol{w}_{\boldsymbol{v}} \cdot \boldsymbol{\mathcal{L}}_{\boldsymbol{v}}\left(  \boldsymbol{m} \right)
\ + \
w_{E} \, \mathcal{L}_{E}\left(  \boldsymbol{m} \right) ,
\end{equation*}
which also defines a prior distribution.
The weights, $w_{\rho}$, $\boldsymbol{w}_{\boldsymbol{v}}$, and $w_{E}$, are hyper-parameters of the prior that determine the width of the distribution around each of the steady-state constraints and can, furthermore, be used to weigh their relative importance.

We should note that the loss originating from the continuity equation, in practice, is by far the most useful, since it does not depend on external parameters, like the gravitational potential ($\Phi$) or the cooling function ($\Lambda$), which are often difficult to describe accurately.


Many other types of regularisation loss can be considered.
In principle, any equation involving the model parameters, say, $a(\boldsymbol{m})=b(\boldsymbol{m})$, can be enforced on the model as regularisation or prior by including a loss term proportional to a monotonically increasing function of $|a(\boldsymbol{m})-b(\boldsymbol{m})|$.
This can be used, for instance, for non-LTE line radiative transfer, to enforce the statistical equilibrium equations on the level populations.
Since this requires an efficient way to compute the mean intensities in the line, which poses a challenge for our current implementation, we postpone this to future work and limit ourselves currently to LTE line radiative transfer.  