# 1. General problem statement

### What do we look for?

We look for the optimal design $d^*$ such as:
\begin{equation}
 d^* = \arg \max_{d}{U(d)}
\end{equation}
with $U(d)$, a function that returns the 'utility' of the design in terms of information regarding the estimation of parameters OR model comparison.


### Notation / Core entities

$\mathcal{D}$: the design space. It can be discreet or continuous depending on the case of application. It can also be multi dimensional. We can also think of it as the 'query space'.

Example: in the case of the 'exponential decay' model, it is the delay between the initial presentation and the recall testing (for instance, a number of seconds).

$\mathcal{M}$: the model space. It is discreet.

Example: considering memory models, a possible model space is $\{POW, EXP\}$, with $POW$ a power law model of forgetting, such that the probability of recall is:
\begin{equation}
p(\delta) = \alpha \delta^{- \beta}
\end{equation}

and $EXP$, a exponential decay of forgetting, such that the probability of recall after a delay $\delta$ is:
\begin{equation}
p(\delta) = \alpha e^{-\beta \delta}
\end{equation}
with $\alpha$ and $\beta$, the free parameters.

$\Theta_m$: the parameter space for the model $m \in \mathcal{M}$. It is $x$-dimensional, $x$ being the number of free parameters of the model $m$.

$Y$: the observation space. It can be discreet of continuous space. It could be multi-dimensional. 

Example: considering memory models, it can be $\{0, 1\}$, with $1$, a successful recall, and $0$ a missed recall.


# 2. Parameter inference

### Define the utility function

Based on Cavagnaro (2009) https://pubmed.ncbi.nlm.nih.gov/20028226/ and Myung (2013) https://pubmed.ncbi.nlm.nih.gov/23997275/, we define a utility function $U$ as the mutual information of $\Theta$ and $Y\mid d$:
\begin{equation}
    U(d) = I(\Theta; Y\mid d)
\end{equation}
Y being the possible observations (e.g., behavioral outputs).

### Alternative definitions

1. It can be based on the Kullback–Leibler divergence between the posterior and the prior. For instance, following Ouyang et al. (2016) https://arxiv.org/pdf/1608.05046.pdf:

\begin{equation}
U(d) = \mathbb{E}_{p(y \mid d)} D_{KL}(P(\Theta | d, y), P(\Theta)) 
\end{equation}

Something in the same line of thought could be
\begin{equation}
U(d) = \int p(y \mid d) D_{KL}(P(\Theta | d, y), P(\Theta)) d y
\end{equation}
Note: Formulation here probably needs correction/adpatation.


2. It can be based only on the entropy of the posterior:

\begin{equation}
U(d) = \int p(y \mid d) H(\Theta \mid d, y) d y 
\end{equation}
Note: Formulation here probably needs correction/adpatation.

For example: Candela et al. (2018) https://www.ri.cmu.edu/publications/automatic-experimental-design-using-deep-generative-models-of-orbital-data/

### Bayesian revision of belief regarding the parameters
Supposing that an experiment was carried out with design $d^*$, and an outcome $y_t$ was observed, the prior over $\theta \in \Theta$ at $t$, $p_t(\theta)$, is revised the following way:
\begin{equation}
     p_{t+1}(\theta) = \dfrac{
      p(y_t \mid \theta, d^*) p_t(\theta)
     }{
        \int p(y_t \mid \theta, d^*) p_t(\theta) d\theta
     }
\end{equation}
with $y_t \in Y$, the observation at $t$, $d^*\in \mathcal{D}$, the design chosen.

### Exchanging definitions

Still based Myung et al. (2013), a way to define $U$ in case of parameter estimation is:

\begin{equation}
U(d) = \int \int u(d, \theta, y) p(y \mid \theta, d) p(\theta) d\theta dy
\end{equation}
if $U(d) = I(\Theta; Y\mid d)$ then $u(d, \theta, y) = \log \left( \frac{p (\theta \mid y, d)}{p(\theta)} \right)$
which means

\begin{equation}
U(d) = I(\Theta; Y\mid d) = \int \int \log \left( \frac{p (\theta \mid y, d)}{p(\theta)} \right) p(y \mid \theta, d) p(\theta) d\theta dy
\end{equation}

Indeed:
\begin{equation}
\begin{split}
U(d) &= I(\Theta; Y\mid d) \\
&= H(\Theta) - H(\Theta \mid Y, d)\\
\end{split}
\end{equation}

\begin{equation}
\begin{split}
H(\Theta) &= - \int p(\theta) \log p(\theta) d\theta \\
&= - \int p(\theta) \log p(\theta) \int p(y \mid \theta, d) dy d\theta \\
&= - \int \int p(\theta) \log p(\theta) p(y \mid \theta, d) dy d\theta \\
&= - \int \int p(\theta) p(y \mid \theta, d) \log p(\theta) dy d\theta \\
\end{split}
\end{equation}

\begin{equation}
\begin{split}
H(\Theta \mid Y, d) &= - \int p(y \mid d) \int p( \theta \mid y, d) \log p(\theta \mid y, d) dy d\theta \\
&=  - \int p(y \mid d) \int p(\theta) p( y \mid \theta, d) \frac{1}{p(y \mid d)}   \log p(\theta \mid y, d) dy d\theta \\
&=  - \int \int p(y \mid d) p(\theta) p( y \mid \theta, d) \frac{1}{p(y \mid d)}   \log p(\theta \mid y, d) dy d\theta \\
&= - \int \int p(\theta) p( y \mid \theta, d)   \log p(\theta \mid y, d) dy d\theta \\
\end{split}
\end{equation}

\begin{equation}
\begin{split}
U(d) &= \int \int - p(\theta) \log p(\theta) p(y \mid \theta, d) + p(\theta) p( y \mid \theta, d)   \log p(\theta \mid y, d) dy d\theta \\
&= \int \int - p(\theta) p(y \mid \theta, d) \log p(\theta) + p(\theta) p( y \mid \theta, d)   \log p(\theta \mid y, d) dy d\theta \\
&= \int \int p(\theta) p(y \mid \theta, d) [ \log p (\theta \mid y, d) - \log p(\theta) ] dy d\theta \\
&= \int \int p(\theta) p(y \mid \theta, d) \log \frac{p (\theta \mid y, d)}{p(\theta)} dy d\theta \\
\end{split}
\end{equation}

### Implementation

Implementation follows what has been done in: https://github.com/adopy/adopy

Actually, in the implementation, we can use the fact that the mutual information is symetric. Indeed, for any pair of random variables $X$ and $Y$, we have:

\begin{equation}
I(X; Y) = H(X) - H(X \mid Y) = I(Y; X) = H(Y) - H(Y \mid X)
\end{equation}

which means that in our case, we have:
\begin{equation}
I(\Theta; Y \mid d) = I(Y \mid d; \Theta) = H(Y \mid d) - H(Y \mid \Theta, d) 
\end{equation}

The marginal entropy is:

\begin{equation}
   H(Y \mid d) = - \int p(y \mid d) \log p(y \mid d) dy
\end{equation}

In our implementation, we use log probabilities instead of proabilities, and $Y$ is a discreet space:
\begin{equation}
   H(Y \mid d) =  - \sum_{y \in Y} \exp[\log p(y \mid d)] \log p(y \mid d)
\end{equation}

The conditional entropy of $Y$ given the outcome random variable $\Theta$ and design $d$ is:
\begin{equation}
        H(Y \mid \Theta, d) = - \int p(\theta) \int p(y\mid \theta, d) \log p(y\mid \theta, d) dyd\theta
\end{equation}

In our implementation, we use log probabilities instead of proabilities, and $Y$ is a discreet space, and since we use a grid exploration technique, $\Theta$ is also considered to be a discreet space:
\begin{equation}
H(Y \mid \Theta, d) = - \sum_{\theta \in \Theta} \exp[\log p(\theta)] \sum_{y\in Y} \exp[ \log p(y\mid \theta, d)] \log p(y\mid \theta, d)
\end{equation}

# 3. Model comparison

In case of model comparison, still following Cavagnaro (2009) and Myung et al. (2013), we need to modify our definition of $U$:

\begin{equation}
     U(d) = \sum_m p(m) \int \int u(d, \theta_m, y_m) p(y_m \mid \theta_m, d) p_{\theta_m} dy_m d\theta_m,
\end{equation}
where m = {1, 2, ..., K} is one of a set of K models 

\begin{equation}
\begin{split}
U(d) &= I(M; Y\mid d) \\
u(d, \theta_m, y_m) &= \log \frac{p(m \mid y, d)}{p(m)}
\end{split}
\end{equation}

where $I(M; Y |d)$ is the mutual information between the model random variable M and the outcome random variable conditional upon design d, Y |d. 

$p(m \mid y, d)$ is the posterior model probability of model $m$ obtained by Bayes rule as:
\begin{equation}
p(m|y, d) = \frac{p(y|m, d)p(m)}{p(y|d)} 
\end{equation}

where
\begin{equation}
p(y|m, d) = \int p(y| \theta_m, d)p(\theta_m) d\theta_m 
\end{equation}

\begin{equation}
p(y|d) = \sum_m p(y|m, d)p(m)
\end{equation}

### Bayesian update of belief regarding the model

Cavagnaro (2009) and Myung et al. (2013) propose to use:
\begin{equation}
p_{t+1}(m) = \frac{p_1(m)}{\sum_{k=1}^K p_1(k) BF_{(k, m)}(y_t \mid d^*)}
\end{equation}

\begin{equation}
BF(k,m) (y \mid d) = \frac{\int p(y \mid \theta_k, d) p(\theta_k) d\theta_k}{\int p(y \mid \theta_m, d) p(\theta_m) d\theta_m} 
\end{equation}

Note: the fact that $p_1(m)$ is used instead of $p_t(m)$, which confers a lot of influence to the initial prior (i.e. the prior before beginning the experience).

An alernative possibility would be to use:
\begin{equation}
\begin{split}
p_{t+1}(m) &= \frac{p_t(m)p(y_t \mid m, d^*)}{\sum_{k=1}^{K} p(k) p_t(k) p(y_t \mid m, d^*)}\\
&= \frac{p_t(m) \int p(y_t \mid \theta_m, d^*) p(\theta_m) d\theta_m}{\sum_{k=1}^{K} p(k) \int p(y_t \mid \theta_k, d^*) p(\theta_k) d\theta_k}\\
\end{split}
\end{equation}

### Implementation

For the implementation, we need to work with log probabilities instead of probabilities (to avoid as much as possible errors due to the floating point precision) 
\begin{equation}
\log p_{t+1}(m) = \log p_1(m) - \log \sum_{k=1}^K \exp \left( \log p_1(k) + \log \sum_{\theta \in \Theta_k} \exp[\log p(y\mid \theta, d) + \log p (\theta)]  - \log \sum_{\theta \in \Theta_m} \exp[\log p(y\mid \theta, d) + \log p (\theta)]  \right)
\end{equation}

\begin{equation}
U(d) = \sum_{m \in M} p(m) \sum_{\theta \in \Theta_m} \sum_{y \in Y} \exp \left( 
\log \left(
\log \sum_{\theta' \in \Theta_m} \exp[ 
\log p(y \mid \theta', d) + \log p(\theta')]
- \log \sum_{m' in M} \exp \left( \log p(m) +  
\log \sum_{\theta' \in \Theta_m} \exp[ 
\log p (y \mid \theta', d) + \log p(\theta')] \right)\right)
+ \log p(y \mid \theta, d) + \log p(\theta)
\right)
\end{equation}