## Bayesian Inference of the brain and the derivation of the Free Energy Principle



$\DeclareMathOperator*{\argmax}{arg\,max}$
The idea of the Free Energy Principle has its foundations in Bayesian Inference, specificall Variational Bayesion Methods, as well as the theory of predictive coding. 'Inference' in this case means infering the probability of a hidden state (or hypothesis) given an observation (evidence). This brings us to Bayes' formula:

**$p(H|E) = \frac{p(E|H)p(H)}{p(E)}$**

To see how this equation works in practice, let's introduce an example, taken from [Bogacz's paper](https://www.sciencedirect.com/science/article/pii/S0022249615000759).

Consider a simple life form with one photo-sensitive receptor. This life form scuttles around trying to determine the size of objects around it based on reflective light. Let the size of an object = $v$. This simple life form has a difficult time perceiving the correct light intesnity as its photo-receptor provides a noisy input. Therefore when the size of an object is $v$, the organism perceives a normally distributed light intensity $p(u|v) =N(g(v), \sum_u)$ where $g$ is a mapping of size to light intensity approximated by the organism. For now let $g(v) = v^2$. Addtionally, the organism has a prior assumption on the sizes of objects around it. For simplicity, let's imagine this prior as a normal distribution with mean $v_p$ and variance $\sum_p$. Therefore $p(v) = N(v_p,\sum_p)$. We now have all the components necessary to construct Bayes' theorem with $u$ as the evidence and $v$ as the hypothesis:


**$p(v|u) = \frac{p(u|v)p(v)}{p(u)}$**


This, however, presents a problem. Calculating the numerator is easy, as it is described by the multiplication of two known normal distributions (stated above). The denominator, however, by the law of total probability, is expressed as: $p(u) = \int p(u|v)p(v)dv$. In many scenarios, these integrals are intractable, and are especially impossible for organisms utilising neuronal connections for calculation. Therefore, the posterior distribution $p(v|u)$ needs to be calculated some other way. This is where Variational Bayes comes in. We might not be able to calculate the exact posterior, but we can try approximate it. To start, lets just consider a very coarse approximation of determining the mode of the posterior. In the case of our organism, the mode is important as it representsthe most likely $v$ given the observed $u$ and the prior. It is reasonable to think that it is realistic and useful for organisms to only entertain most likely hypotheses rather than the probability of all possible hypotheses. Therefore we want to find the value of $v$ which is the maximum point of the posterior. Let this $v = \phi = \argmax_v p(v|u)$. The important part about estimating just one value of $v$ is that the denominator $p(u) = \int p(u|v)p(v)$ does not depend on $\phi$ and is constant for any value $v$ of which we want to calculate the probability. Finding $\phi$ then simply becomes a case of finding the value of $v$ which maximises the numerator: $p(u|v)p(v)$ putting this all together, we have:

$max_vp(v|u) = p(u|\phi)p(\phi)$

Now from here all we need do is iteratively determine $\phi$. This can be achieved via gradient ascent. We take the logarithm of the numerator because the maximum of it is the same as the maximum of $p(u|v)p(v)$ and is an easier function to work with as calculating $p(u|v)p(v)$ involves exponentiation:

$F = ln(p(u|v)p(v))$

$= ln(pu|v) + ln p(v)$
  
$= ln \bigg[\frac{1}{\sqrt{2\pi\Sigma_p}}exp\bigg(-\frac{(\phi - v_p)^2}{2\Sigma_p}\bigg)\bigg] + ln \bigg[\frac{1}{\sqrt{2\pi\Sigma_u}}exp\bigg(-\frac{(u - g(\phi))^2}{2\Sigma_u}\bigg)\bigg]$

$...$

$...$

$ = \frac{1}{2}\bigg(-\frac{(\phi - v_p)^2}{\Sigma_p} - \frac{(u-g(\phi))^2}{\Sigma_u}\bigg) + C$

<br/>

In the last line, all the constant terms not involving $\phi$ have been incorporated into a constant $C$ as they will dissapear when the derivative is taken:
<br/>
<br/>
$ \frac{\partial F}{\partial \phi} = \frac{v_p - \phi}{\Sigma_p} + \frac{u-g(\phi)}{\Sigma_u}g'(\phi)$
<br/>
With this equation we can pefrom gradient ascent until $\theta$ converges on a value. This will be the value which maximises $F$ and therefore maximises $p(v|u)$. 
Looking at the form of the equation, it is evident that the gradient of $F$ is being influenced in two different ways. One by how much the approximated hypothesis $\phi$ is different from the prior mean hypothesis and the other by how different the observation, $u$, is from the expected observation given the approxmated hypothesis $g(\phi)$. There seems to be a tradeoff happening here. The prior and the [likelihood](https://en.wikipedia.org/wiki/Likelihood_function) are each pulling the posterior toward their mean values, with it ultimately being a weighted average between the two. The weighting here is exactly determined by the variance of the two terms respectively, with higher variance resulting in a less 'reliable' contribution to the posterior. Ths makes sense, as the more noisy the prior or observation is, the less one would want it to contribute to the inference of a posterior hypothesis. 

To frame this in terms of neuronal activity, let us denote these two terms in the derviative of $F$ as:

$\epsilon_p = \frac{\phi - v_p}{\Sigma_p}$
$\epsilon_u = \frac{u - g(\phi)}{\Sigma_u}$

As alluded to above, these can be viewed as **weighted prediction errors** and could be realised in a simple neuronal structure as follows:

<center><img src="my_icons/simple_network.png" width="400" height="400"/></center>
    

Here lines with arrows denote excitatory connections while lines with circles denote inhibitory connections. The circular connection between the prediction error nodes and the inference node allows for iterative update of all three as the maximum posterior is calculated. The update equations of the three inner nodes are as follows:

$\phi_{new} = \phi_{old} + a(\epsilon_ug'(\phi) - \epsilon_p)$


$\epsilon_{u_{new}} = \epsilon_{u_{old}} + a(u - g(\phi_{new}) - \epsilon_{u_{old}}\Sigma_u)$

$\epsilon_{p_{new}} = \epsilon_{p_{old}} + a(\phi_{new} - v_p - \epsilon_{p_{old}}\Sigma_p)$

where $0<a<1$ is some constant to allow for better convergence. Note that in the update equation for $\phi$, the quantity from the excitatory 'evidence' neuron $\epsilon_ug'(\phi)$ is being added, while the quantity from the inhibitory 'prior' neuron $\epsilon_p$ is being subtracted. This exemplified the two componenets 'pulling' the posterior in different directions as stated above. 
Also note that the error neurons are being self-inhibited by their respective variances, which represent the weighting on the components used in calculating the inferenced hypothesis.


The next step for the organism would be to shift its prior and likelihood calculation so as to better represent the new observation. In this case, it would mean changing the $v_p, \Sigma_p$ and $\Sigma_u$ as well as sometimes the transformation function $g$. Essentailly, the organsim would like to maximise $p(u)$ on average. If it is constantly inferring hypotheses that it gives little prior probability to, $p(v)$, and little likelihood to, $p(u|v)$ perhaps it should adjust the parameters that determine these two factors. In bayesian Inference, this is what model evidence is, referred to in frequentist statistics as marginal likelihood. However trying to maximise $p(u)$ once again leaves us with the problem of the intractable marginalisation: $p(u) = \int p(u|v)p(v)dv$. To avoid this we can turn to maximising a related and familiar expression: $p(u, \phi)$. At first glance it might not seem clear why maximising this term, rather the whole model evidence, is useful. Remember that $p(u)$ essentially represents a weighted average of all possible likelihoods. However given that we are not necessarily attempting to calculate the entire posterior distribution, rather just one point of the posterior, it is reasonable to only care about maximising model evidence with respect to this point. Note that $p(u,\phi) = p(u|\phi)p(\phi) = F$. Therefore by maximising $F$ with respect to the various parameters mentioned above will maximise the model evidence for the point of maximum likelihood, $\phi$. The update to these parameters would therefore be proportional to the derivative of $F$ with respect to each:
$\frac{\partial F}{\partial v_p} = \frac{\phi - v_p}{\Sigma_p}$

$\frac{\partial F}{\partial \Sigma_p} = \frac{1}{2}\bigg(\frac{(\phi - v_p)^2}{\Sigma_p^2} - \frac{1}{\Sigma_p}\bigg)$

$\frac{\partial F}{\partial \Sigma_u} = \frac{1}{2}\bigg(\frac{(u - g(\phi))^2}{\Sigma_u^2} - \frac{1}{\Sigma_u}\bigg)$

    
Additionally, the translation, $g$ may be incorrect. In its most complex form, updating this function would require the approximating power of a separate neural network. In its simplest form, the function form is static but weighted by a parameter $\theta$ which can represent the strength of the connection between the neurons which pass the translated data - in this case between the neurons computing $\epsilon_u$ and $\phi$ (see picture above). The update update to this weighting would therefore be:
$\frac{\partial F}{\partial \theta} = \epsilon_u\phi$




### Free Energy


All the formulations so far have been an effort 