## Bayes module

### Variational inference

Given a score $S$, a parametric family of distributions $(\nu_{\theta})_{\theta \in \Theta}$ and a prior distribution $\pi$, we consider the variational problem

$$\hat{\theta} = \arg\inf \nu_{\theta}[S] + \lambda * KL(\nu_{\theta}, \pi).$$

The function variational_inference is designed to tackle such problems in the setting where $\pi =  \nu_{\theta_0}$. This is in order to benefit from potential closed form expressions when computing the Kullback--Leibler divergence and its derivative.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

from aduq.bayes import iter_prior, variational_inference, iter_prior_vi
from aduq.proba import GaussianMap, TensorizedGaussianMap

# For plotting purposes
from math import pi
angles = np.linspace(0, 2 * pi, 1000)
circle = np.array([np.cos(angles), np.sin(angles)])

def half_cov(cov):
    vals, vects = np.linalg.eigh(cov)
    return (np.sqrt(vals) * vects) @ vects.T

def repr_gauss(mean, cov, rad =1.0):
    loc_circle = circle.copy()
    return mean + rad * (half_cov(cov) @ loc_circle).T

# Toy score function
def score(x):
    return (x @ np.array([0, 1]) - 1) ** 2 + 10 * (x @ np.array([1, -1])) **2

We need to define the space of probability distributions on which we wish to optimize. Here we consider a score defined on a two dimensional space, and therefore use gaussian distributions on $\mathbb{R}^2$. The prior will be the standard distribution

It is normal behavior that the optimisation procedure raises some ProbaBadGrad warnings.These indicate that a problematic gradient estimation was rejected as it damaged significantly the score. No need to worry about those.

In [None]:
gauss_map = GaussianMap(2)

# We define the prior as the reference gaussian distribution, i.e. N(0,Id)
prior_param = gauss_map.ref_param

# To solve the variational inference problem, we use the variational_inference function.
opt_res = variational_inference(
    score, gauss_map,
    prior_param=prior_param,
    temperature=.1, # the lambda term in the variational inference problem
    per_step=160,
    VI_method='corr_weights',
    gen_decay=np.log(1.1),
    k = 160 * 20,
    parallel=False,
    vectorized=True,
    print_rec=100, chain_length=501,
    refuse_conf=.95,
    momentum=.95, eta=0.6, silent=False)

# It is normal behavior that the optimisation procedure raises some ProbaBadGrad warnings.
# These indicate that a problematic gradient estimation was rejected as it damaged significantly
# the score. No need to worry about those.

# We can access the parameter describing the posterior through the opti_param attribute
post_param = opt_res.opti_param

In [None]:
# The optimisation start by modification of the covariance 

for i, param in enumerate(opt_res.hist_param[:13:2]):
    if i % 1 == 0:
        distr = gauss_map(param)
        distr_repr = repr_gauss(distr.means, distr.cov)
        plt.plot(distr_repr[:,0], distr_repr[:,1], color='black', linewidth=0.2)

In [None]:
# The distribution then shifts towards the correct mean value
for i, param in enumerate(opt_res.hist_param[20:160:20]):
    if i % 1 == 0:
        distr = gauss_map(param)
        distr_repr = repr_gauss(distr.means, distr.cov)
        plt.plot(distr_repr[:,0], distr_repr[:,1], color='black', linewidth=0.2)

In [None]:
# The last steps then adjust the distribution
for i, param in enumerate(opt_res.hist_param[150:200:2]):
    if i % 1 == 0:
        distr = gauss_map(param)
        distr_repr = repr_gauss(distr.means, distr.cov)
        plt.plot(distr_repr[:,0], distr_repr[:,1], color='black', linewidth=0.2)

In [None]:
# The evolution of the VI score can also be tracked:
plt.plot(opt_res.hist_score)
plt.yscale("log")

Under the hood, variational_inference can redirect to two routines (VI_method argument): either "corr_weights" or "KNN". The name refers to the method used in order to make most use of the evaluations to the score function.

The 'variational_inference' function was designed for situations where evaluating the 'score' is rather expensive. It is still, however, an accelarated gradient descent algorithm. The change is that the gradient's expression involves an expectation with respect to the current distribution. The naïve approach consisting in sampling iid samples from the current distribution to obtain an unbiased estimation of the expectation is improved upon by recycling previous samples. These are generated from distributions similar to the current one, if small optimization steps are done ('eta' parameter is small).

As it is not possible to use these samples directly, two procedures are proposed. "corr_weights" consists in giving each sample a weight to adjust for the difference of probability for it being drawn between the current and previous distributions. "KNN" consists in constructing a surrogate score using a K-Nearest neighbor algorithm, then using this surrogate on a large number of samples to compute the derivative.

The number of samples used all in all when evaluating the derivative is controlled by the argument 'k'. By default it is None, amounting to all samples being used.

For "corr_weights", it is possible and advisable to set the 'gen_decay' parameter higher than 0 (default value). The 'gen_decay' parameter gives a decreasing weights to older generations when computing the derivative. While generations just before tend to be close to the current one, older ones would no longer be representative, and could have a negative impact when computing the derivative. The higher 'gen_decay', the lower will be the influence of older generation (exponentially decreasing weights of $\exp(-t \times gen\_decay)$ are used).

For "KNN", the number of neighbors used by the K-nearest neighbors algorithm is NOT controlled by the argument 'k', but by "n_neighbors". As stated above, "k" controls the number of samples used. By default, "n_neighbors" is 5.


The 'corr_weights' method has the edge in most cases. For instance, 'KNN' by design does not like situations where the Hessian near the minima has eigenvalues of different magnitudes, which is the case for the Rosenbrock function tested here. This could be improved upon by learning the distance used in 'KNN', or by training different surrogates.

In [None]:
# For comparison, variational_inference with KNN method

opt_res = variational_inference(
    score, gauss_map,
    prior_param=prior_param,
    temperature=.1,
    per_step=160,
    VI_method='KNN',
    k = None,
    parallel=False, print_rec=20, chain_length=600,
    vectorized=True,
    momentum=.99, eta=0.1, silent=True)

end_distr = gauss_map(opt_res.opti_param)

print(f"Mean score of estimated posterior: {end_distr.integrate(score, n_sample = 1000)}")

# The evolution of the VI score can also be tracked:
plt.plot(opt_res.hist_score)

### Iter prior procedure

The iterated prior procedure is not a Bayesian technique at all. It is actually an optimisation routine, using a Bayesian flavored technique.

The goal is minimizing $S(x)$, a score function. In order to do that, parameters are drawn from a distribution. The distribution for the next generation is then obtained by centering around the best parameter found so far, and by using the top parameters found so far to construct the covariance. Each dimension of the parameter is drawn independantly from a gaussian distribution, so that the covariance is diagonal and can be defined by using the empirical standard deviations of the top parameters.

In [None]:
# The initial prior_param is a parameter for the TensorizedGaussianMap.
ini_prior = np.zeros((2,2))
ini_prior[1] = 1.0

opt_res = iter_prior(score, ini_prior_param = ini_prior, gen_per_step=800, chain_length=50, keep=100, frac_sparse=0.0, parallel=False)

# The opti_param attribute of opt_res gives a distribution and NOT a parameter
opti_distr_param = opt_res.opti_param

# The optimal parameter can still be found:
opti_param = opt_res.full_sample[0]
print(opti_param)

The technique used in iter prior can still be useful in the context of variational inference, in order to construct quickly a good initial distribution. The function iter_prior_vi is designed precisely for that purpose.

In [None]:
opt_res = iter_prior_vi(
    score,
    prior_param = ini_prior, temperature=0.1, gen_per_step=800, chain_length=50, keep=100, frac_sparse=0.0,
    parallel=False, vectorized=True)

# The opti_param attribute of opt_res gives a distribution and NOT a parameter
opti_distr_param = opt_res.opti_param

start_post = np.zeros((3,2))

start_post[0] = opti_distr_param[0]
start_post[1:] = np.diag(opti_distr_param[1])

opt_res = variational_inference(
    score, gauss_map,
    prior_param=prior_param,
    post_param=start_post,
    temperature=.1,
    per_step=160,
    VI_method='corr_weights',
    gen_decay=np.log(1.2),
    k = 160 * 20,
    parallel=False,
    vectorized=True,
    print_rec=2, chain_length=50,
    refuse_conf=.95,
    momentum=.95, eta=0.1, silent=False)


In [None]:
plt.plot(opt_res.hist_score)

## Uniform priors - Gaussian computations

The proba module offers a class of distributions on the hypercube benefitting from Gaussian like interpretation when the distribution are sufficiently concentrated and exact computations for KL.

In [None]:
from aduq.proba.gauss import GaussHypercubeMap

dim = 2

# Toy score function
def score(x):
    return (x @ np.array([1.0, 0.0], dtype=np.float64) - .6) ** 2 + 20 * (x @ np.array([1.0, -1.0], dtype=np.float64)) **2

pmap = GaussHypercubeMap(2)

In [None]:
opt_res = variational_inference(
    score, pmap,
    temperature=.1, # the lambda term in the variational inference problem
    per_step=160,
    VI_method='corr_weights',
    gen_decay=np.log(1.3),
    k = 160 * 30,
    parallel=False,
    vectorized=True,
    print_rec=10,
    chain_length=201,
    refuse_conf=.95,
    momentum=.95, eta=0.4, silent=False)

The posterior can adapt to scores with strong identifiability issues such as Rosenbrock, since the probabilities can exhibit strong correlation structure

In [None]:
import seaborn as sns
proba = pmap(opt_res.opti_param)
# The log density of the function can be accessed through log_dens
x_axis_labels = np.linspace(10**-4,1 - 10** -4, 121)
y_axis_labels = np.linspace(10**-4,1- 10 ** -4, 121)

values = np.array(np.meshgrid(y_axis_labels, x_axis_labels)).T
z = proba.dens(values)

sns.heatmap(z, xticklabels=x_axis_labels, yticklabels=y_axis_labels)
plt.title("Posterior distribution")
plt.xticks([])
plt.yticks([])

In [None]:
plt.plot(opt_res.hist_score, label="Evolution of the VI score")
plt.yscale("log")
plt.legend()
plt.show()