<a href="https://colab.research.google.com/github/MiguelAguilera/Neuro-MaxEnt-inference-tutorial/blob/main/1.Introduction_to_MaxEnt_methods.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to the MaxEnt principle

## 1. A new kind of prior information

Most patterns in biology arise from aggregation of many small processes. Variations in the dynamics of complex neural and biochemical networks depend on
numerous fluctuations in connectivity and flow through small-scale subcomponents of the network. Variations in cancer onset arise from variable failures in the many individual checks and balances on DNA repair, cell cycle control, and tissue homeostasis. Variations in the ecologicalult value implied by bias is overridden. Note that ddof=1 will return the unbiased estimate, even if both fweights and aweights are specified, and ddof=0 will return the simple av distribution of species follow the myriad local differences in the birth and death rates of species and in the small-scale interactions between particular species.
In all such complex systems, we wish to understand how large-scale pattern
arises from the aggregation of small-scale processes. A single dominant principle sets the major axis from which all explanation of aggregation and scale must be developed. This dominant principle is the limiting distribution.



Imagine a class of problems in which our prior information consists of average
values of certain things. What is the less biased model?

The notion of ‘entropy’ as originated in thermodynamics is usually associated to that of ‘disorder’ by saying that the former can be regarded as a measure of the latter. The word ‘disorder’ here essentially means ‘randomness’, ‘absence of patterns’, or something similar. While not incorrect, these words clearly require a more precise specification to be useful at a quantitative level. 

$$ S = - \sum_{\mathbf x} p_{\mathbf x} \log p_{\mathbf x}$$

We have a total amount of probability
$$ \sum_{\mathbf x} p_{\mathbf x}= 1$$


> ![Google's logo](https://github.com/MiguelAguilera/Neuro-MaxEnt-inference-tutorial/blob/main/img/entropy.png?raw=true)
>
> Figure 1. The natural world is driven to maximum entropy states (i.e. maximum uncertainty)

The principle of maximum entropy can be expressed (see [Wikipedia](https://en.wikipedia.org/wiki/Principle_of_maximum_entropy)) as:

> The principle of maximum entropy states that, subject to precisely stated prior data (such as a proposition that expresses testable information), the probability distribution which best represents the current state of knowledge is the one with largest entropy. Another way of stating this: Take precisely stated prior data or testable information about a probability distribution function. Consider the set of all trial probability distributions that would encode the prior data. According to this principle, the distribution with maximal information entropy is the proper one. … In ordinary language, the principle of maximum entropy can be said to express a claim of epistemic modesty, or of maximum ignorance. The selected distribution is the one that makes the least claim to being informed beyond the stated prior data, that is to say the one that admits the most ignorance beyond the stated prior data.




<div style="page-break-after: always;"></div>

## 2. Deriving the MaxEnt principle

### 2.1 Lagrangian multiplier techinque

The maximum entropy principle is a means of deriving probability distributions given certain constraints and the assumption of maximizing entropy. One technique for solving this maximization problem is the Lagrange multiplier technique.

Given a multivariable function $f(\mathbf x, \lambda)$ and constraints of the form $g(\mathbf x)=c$, where $g$ is another multivariable function with the same input space as $f$ and $c$ is a constant:

In order to minimize (or maximize) the function $f$ consider the following steps, assuming $f$ to be $f(x)$:

1. Introduce a new variable $\lambda$, called Lagrange multiplier, and define a new function $\mathcal{L}$  with the form:

$$ \mathcal{L}(x,\lambda) =f(x)+\lambda(g(x)−c)$$

2. Set the derivative of the function  $\mathcal{L}$  equal to zero:

$$ \frac{\partial L (x,\lambda)}{\partial x_i}=0, \,\,\,\forall i. \qquad \frac{\partial L (x,\lambda)}{\partial \lambda}=0$$

in order to find the critical points of  $\mathcal{L}$.

3. Consider each resulting solution within the limits of the made constraints and derive the resulting distribution $f$, which gives the minimum (or maximum) one is searching for.

### Application to the MaxEnt principle

Applied to the Maximum Entropy principle, the Lagrangian multiplier technique results in 

$$ \max_{p_{\mathbf x}} \sum_{\mathbf x} p_{\mathbf x} \log p_{\mathbf x}$$
$$ \mathrm{s.t.} \qquad \sum_{\mathbf x}  p_{\mathbf x} f_a(\mathbf x) = c_a , \qquad \sum \sum_{\mathbf x} p_{\mathbf x} =1$$



### 2.1 Example: mean and variance of a distribution

$$ \mathcal{L} =  -\sum_{\mathbf x} p_{\mathbf x} \log p_{\mathbf x}  - \lambda_0 (\sum_{\mathbf x}p_{\mathbf x} - 1) + \lambda_1 (\sum_{\mathbf x} p_{\mathbf x} x - c_1) + \lambda_2 (\sum_{\mathbf x} p_{\mathbf x} x^2 - c_2)$$

$$  \frac{\mathrm{d}\mathcal{L}}{\mathrm{d}p_i} =  - 1 - \log p_{\mathbf x}
+ \lambda_0  +  \lambda_1 x + \lambda_2 x^2 =0 $$

$$ p_{\mathbf x} \propto  \exp\left[ \lambda_1 x +  \lambda_2 x^2 \right] \right]  \propto \exp\left[ \frac{1}{2\sigma^2}(x-\mu)^2 +  \lambda_2 x^2 \right] $$


$$ p_{\mathbf x} \propto \frac{1}{Z}\exp\left[\beta \sum_a \theta_a f_{\mathbf x,a} \right] $$

In [1]:
import ipywidgets as widgets
from ipywidgets import HBox, VBox
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display
%matplotlib inline

In [80]:
@widgets.interact(
    m=(0., 10.), d=(0., 2., 0.02), sigma=(0.02, 1., 0.02))
def plot(m=3.,d=1,sigma=0.4, grid=False):
  
    # Simulation parameters
    N=10000
    xmax=20.
    xmin=-10.
    ref_mean = 5
    ref_std = 1
    diff=(xmax-xmin)/N
   
    fig, ax = plt.subplots(1, 1, figsize=(12, 6))

    # Plot probability density
    x = np.linspace(xmin, xmax, N)
    p = 0.5*(1/np.sqrt(2*np.pi*sigma**2)*np.exp(-0.5/sigma**2*(x-m-d)**2)+1/np.sqrt(2*np.pi*sigma**2)*np.exp(-0.5/sigma**2*(x-m+d)**2))
    ax.plot(x, p, lw=2)
 
    # Calculate model observables
    mean=diff*np.sum(p*x)
    std=np.sqrt(diff*np.sum(p*x**2)-(diff*np.sum(p*x))**2)
 
    # Calculate model Entropy
    inds = p>np.finfo(float).eps  # avoid log(0) terms
    Entropy = np.round(-diff*np.sum(p[inds]*np.log(p[inds])),4),0.5*(1+np.log(2*np.pi))
  
    # Plot model observables
    plt.plot([mean,mean],[0,np.max(p)*1.1],'r--', lw=2)
    plt.plot([mean+3*std,mean+3*std],[0,np.max(p)*1.1],'g:', lw=2)
    plt.plot([mean-3*std,mean-3*std],[0,np.max(p)*1.1],'g:', lw=2)
 
    # Plot reference observables
    plt.plot([ref_mean],[np.max(p)*1.15],'r*', lw=2)
    plt.plot([ref_mean+3*ref_std],[np.max(p)*1.15],'g.', lw=2)
    plt.plot([ref_mean-3*ref_std],[np.max(p)*1.15],'g.', lw=2)
    ax.grid(grid)
    ax.axis([0,10,0,np.max(p)*1.2])

    print('Mean:',round(mean,2),'Variance:',np.round(std**2,2))
    print('Entropy:',Entropy)


interactive(children=(FloatSlider(value=3.0, description='m', max=10.0), FloatSlider(value=1.0, description='d…

### Application to the MaxEnt principle

Applied to the Maximum Entropy principle, the Lagrangian multiplier technique results in 

$$ \max_{p_{\mathbf x}} \sum_{\mathbf x} p_{\mathbf x} \log p_{\mathbf x}$$
$$ \mathrm{s.t.} \qquad \sum_{\mathbf x}  p_{\mathbf x} f_a(\mathbf x) = c_a , \qquad \sum \sum_{\mathbf x} p_{\mathbf x} =1$$



### Example

$$ \mathcal{L} =  -\sum_{\mathbf x} p_{\mathbf x} \log p_{\mathbf x}  - \varphi (\sum_{\mathbf x}p_{\mathbf x} - 1) + \beta \sum_a \theta_a(\sum_{\mathbf x} p_{\mathbf x} f_{i,a}-c_a)$$

$$  \frac{\mathrm{d}\mathcal{L}}{\mathrm{d}p_i} =  - 1 - \log p_{\mathbf x}
- \varphi  +  \beta \sum_a \theta_a f_{\mathbf x,a}=0 $$

$$ p_{\mathbf x} \propto \exp\left[\beta \sum_a \theta_a f_{\mathbf x,a} - \varphi \right] $$


$$ p_{\mathbf x} \propto \frac{1}{Z}\exp\left[\beta \sum_a \theta_a f_{\mathbf x,a} \right] $$


### References

J. Harte, Maximum Entropy and Ecology: A Theory of Abundance, Distribution, and Energetics. Oxford University Press, 2011

E. Montrell, On the entropy function in sociotechnical systems, PNAS, vol. 78 no. 12, 1981

S. Frank, The common patterns of nature

Academy, Khan. 2019. “Lagrange multipliers, introduction.” 2019. https://www.khanacademy.org/math/multivariable-calculus/applications-of-multivariable-derivatives/constrained-optimization/a/lagrange-multipliers-single-constraint.