# Meta Bayes module

This is a short preview of the Meta Bayes module.
Meta Bayes strives to compute the optimal prior for a set of tasks.

It relies on the penalized regression formulation of the inner PAC-Bayesian algorithm:

$$\hat\theta =\arg\inf_\theta \tilde{S}_i(\theta, \theta_0) := \pi(\theta)[S_i] + \lambda \text{KL}(\pi(\theta), \pi(\theta_0))$$ 

Noting $A_i(\theta_0)$ the solution of the task $i$ using prior $\theta_0$, the meta score can be written as

$$\sum S_{i}^{meta}(\theta_0) = \tilde{S}_i(A_i(\theta_0), \theta_0).$$

The meta learning algorithm uses gradient descent to minimize the meta_score, relying on 

$$\nabla S_i^{meta} = \lambda \nabla F_i $$ 
where $F_i(\theta) = \text{KL}(\pi(A_i(\theta_0)), \pi(\theta))$.

The meta learning tasks considered here are simple tasks where Gaussian conjugation occurs (quadratic risks, gaussian priors). This enables the algorithm to run more quickly in this demo.

In [None]:
from surpbayes.meta_bayes import Task, MetaLearningEnv
from surpbayes.proba import GaussianMap, TensorizedGaussianMap, BlockDiagGaussMap
import numpy as np
# Choose dimension/Number of tasks
d = 4
n_tasks = 100
temperature = 0.1

# Generate tasks
def make_score(x):
    def score(xs):
        return ((x - xs) ** 2).sum(-1)

    return score


x0 = 0.5 + np.random.normal(0, 0.2, d)
x_middles = x0 + np.random.normal(0, 0.1, (n_tasks, d))

task_train = [
    Task(make_score(x_mid), temperature=temperature, vectorized=True) for x_mid in x_middles
]

x_middles_test = x0 + np.random.normal(0, 0.1, (10, d))

task_test = [
    Task(make_score(x_mid), temperature=0.1, vectorized=True) for x_mid in x_middles_test
]

# Define distribution family
proba_map = GaussianMap(d)

# Define Meta Learning Environnement
mlearn = MetaLearningEnv(
    proba_map,
    list_task=task_train,
    # hyperparameters passed to training
    per_step=50,
    chain_length=2,
    kl_max=100.0, # Maximum kl step between posterior estimations. Here it is ext. large to speed up computations.
    silent=True, # Should there be print during each inner learning task
    n_max_eval=200, # Maximum number of risk evaluations per task. This could be set even lower in this dummy setting.
    n_estim_weights=10**3, # Number of samples generated to compute weights. Here it could be even lower.
)

The Meta Learning algorithm can be called using the "meta_learn" method (using SGD by default, use meta_learn_batch for non standard GD).
After each task has been calibrated once, the tasks inner learning hyperparameters can be updated (usually, no need to continue drawing a lot of parameters).

In [None]:
mlearn.meta_learn(epochs=1, eta=2/temperature, kl_max=1.0, mini_batch_size=25)
mlearn.hyperparams.update({"per_step":20, "chain_length":1})
mlearn.meta_learn(epochs=20, eta=1/temperature, kl_max=1.0, mini_batch_size=25)
mlearn.hyperparams.update({"n_estim_weights":10**2}) # Gain more time
mlearn.meta_learn(epochs=180, eta=0.5/temperature, kl_max=0.2, mini_batch_size=50)

In [None]:
from surpbayes.meta_bayes.test_assess import eval_meta_hist
res = eval_meta_hist(mlearn.hist_meta.meta_params()[::2], task_test, proba_map = proba_map, hyperparams = {"per_step": 50, "chain_length":1, "silent":True})

In [None]:
low_quant, high_quant = np.apply_along_axis(lambda x: np.quantile(x, [0.2, 0.8]), 1, res).T
test_perf = res.mean(1)

In [None]:
import matplotlib.pyplot as plt
plt.fill_between(np.arange(len(low_quant)), low_quant, high_quant)
plt.plot(test_perf, color="black", linewidth=1)
plt.xlabel("Meta training steps")
plt.ylabel("Generalisation bound")

The evolution of each independant test set can also be ascertained.

In [None]:
import matplotlib.pyplot as plt
for i in range(res.shape[1]):
    plt.plot(res[:, i], linewidth=1.0)
plt.xlabel("Meta training steps")
plt.ylabel("Generalisation bound")

## Covariance case

In [None]:
# Choose dimension/Number of tasks
d = 4
true_dim = 1
n_tasks_train = 50
n_tasks_test = 20
temperature = 0.1

# Generate tasks
def make_score(x):
    def score(xs):
        return ((x - xs) ** 2).sum(-1)

    return score


matrix = np.random.normal(0, 1, (true_dim, d))
matrix = matrix / np.sum(matrix **2)
x_middles = np.random.normal(0, 1.0, (n_tasks_train + n_tasks_test, true_dim)) @ matrix + np.random.normal(
    0, 0.01, (n_tasks_train + n_tasks_test, d)
)

list_task = [
    Task(make_score(x_mid), temperature=temperature, vectorized=True) for x_mid in x_middles
]
task_train = list_task[:n_tasks_train]
task_test = list_task[n_tasks_train:]

# Define distribution family
proba_map = GaussianMap(d)

# Define Meta Learning Environnement
mlearn = MetaLearningEnv(
    proba_map,
    list_task=task_train,
    per_step=25,
    chain_length=1,
    n_estim_weights=3 * proba_map.t_shape[0],
    kl_max=1000.0,
    silent=True,
    n_max_eval=200,
)

In [None]:
# Launch training (either through meta_learn or meta_learn_batch)
mlearn.meta_learn_batch(epochs=5000, eta=0.2/temperature, kl_max=1.0, silent=True, kl_tol=10**-8)

In [None]:
plt.plot(mlearn.hist_meta.meta_scores()[800:])
plt.yscale("log")

One can assess the value of the prior covariance eigenvalues, to see that the probability concentrate close to a 1D subspace.

In [None]:
mlearn.proba_map(mlearn.prior_param).vals

## Future improvements

### Sample size for the inner task

In the current implementation, the inner algorithm evaluates a fixed number of parameters generated from the current posterior. This might slow down the algorithm significantly, as once the space has been thoroughly explored, it is not necessary to evaluate many new points (at least not as much as during the early stages). The number of new points evaluated should be estimated depending on how well the current sample explores the posterior.

On the same lines, the positions of the samples evaluated could be optimized.

### Step size adaptation