# Differentiating Policies for Non-Myopic Rollout Bayesian Optimization

$$
\newcommand{\mux}{\mu(x)}
\newcommand{\sigx}{\sigma(x)}
\newcommand{\calpha}{\check{\alpha}}
\newcommand{\bfx}{\mathbf{x}}\newcommand{\bfr}{\mathbf{r}}
$$
Throughout these notes, we consider the derivatives of a user-provided function needed for Newton's method and for differentiation of the argmax with respect to data and hyperparameters. We have some collection of regressors $X \in \mathbb{R}^{d \times n}$ and observations $\mathcal{\textbf{y}} \in \mathbb{R}^{n}$, where our dataset for our supervised machine learning model is denoted as $\mathcal{D}_n = \{(x^i, y_i) : 1 \leq i \leq n\}$. Given $\mathcal{D}_{n}$, we denote the predictive mean and predictive variance at some location $x$ as $\mu(x|\mathcal{D}_n)$ and $\sigma(x|\mathcal{D}_n)$ respectively. In general, we'll suppress the parameter $\mathcal{D}_n$, and write $\mu_n(x)$ and $\sigma_n(x)$.

Now, we consider a user-provided function in terms of the predictive mean, predictive variance, and hyperparameters. We also consider the case where the cost to evaluate is non-uniform, i.e. $c(x): \mathbb{R}^d \to \mathbb{R}$, giving us the general base policy

$$\begin{aligned}
\alpha(x, \theta) &= g(\mu_n(x), \sigma_n(x), \theta).
\end{aligned}$$

We define an $h$-step rollout policy in terms of the user-provided function $f$ as choosing $x_0, \theta_0$ based on the anticipated behavior of $\alpha$ starting from $x_0, \theta_0$ and proceeding for $h$ steps. That is, we consider the iteration
$$
\DeclareMathOperator*{\argmax}{arg\,max}
x_r, \theta_r = \argmax_{x \in \chi, \theta \in \Theta}\; \alpha(x, \theta), \; 1 \leq r \leq h
$$
where each iteration defines a trajectory consisting of $h+1$ decisions.

We'll need derivative of $\check{\alpha}$ w.r.t. $x$ and $\theta$, i.e.:
$$\begin{aligned}
\alpha_{,i} &= \frac{\partial g}{\partial \mu}\mu_{,i} + \frac{\partial g}{\partial \sigma} \sigma_{,i}\\
\alpha_{,ij} &= \frac{\partial^2 g}{\partial\mu^2}\mu_{,i} + \frac{\partial g}{\partial\mu}\mu_{,ij} + \frac{\partial^2 g}{\partial\sigma^2}\sigma_{,i} + \frac{\partial g}{\partial\sigma}\sigma_{,ij}\\
\dot{\alpha}_{,i} &= 
    \frac{\partial g}{\partial \mu \partial \mathcal{D}_n}\mu_{,i} + \frac{\partial g}{\partial \mu}\dot{\mu}_{,i} + \frac{\partial g}{\partial \sigma \partial \mathcal{D}_n} \sigma_{,i} + \frac{\partial g}{\partial \sigma} \dot{\sigma}_{,i}\\
\frac{\partial \alpha}{\partial \theta} &= \frac{\partial g}{\partial \theta} \\
\frac{\partial^2 \alpha}{\partial \theta^2} &= \frac{\partial^2 g}{\partial \theta^2} \\
\frac{\partial \alpha}{\partial x \partial\theta} &= \frac{\partial^2 g}{\partial \theta^2} \\
\frac{\partial \alpha}{\partial \theta \partial \mathcal{D}_n} &= \frac{\partial g}{\partial \theta \partial \mathcal{D}_n}
\end{aligned}$$

## Rollout and Adjoint Mode Differentiation
Consider a rollout trajectory to time step $t$.  We are interested in computing the derivative of the function value $y_t$ at time $t$  respect to the initial condition $\bfx_0$.The basic setup for the forward computation is\begin{align*}  \bfr_j(\bfx_j; \bfx_0, y_0, \ldots, \bfx_{j-1}, y_{j-1}, \theta) &= 0 \\  f(\bfx_j) - y_j &= 0.\end{align*}Here $\bfr_j$ should be thought of as the gradient of the acquisition function at step $j$ (i.e. we seek to maximize the acquisition function at each step).  I will not write out all the details of this calculation here, as we have it elsewhere.

We can write the usual "forward mode" computation of the derivatives by differentiating each equation in term and then doing forward substitution.
$$\begin{bmatrix}  & f'(\bfx_0) & -1 \\  \frac{\partial \bfr_1}{\partial \theta} &
  \frac{\partial \bfr_1}{\partial \bfx_0} &  \frac{\partial \bfr_1}{\partial y_0} &  \frac{\partial \bfr_1}{\partial \bfx_1} \\    &   & f'(\bfx_1) & -1 \\  \frac{\partial \bfr_2}{\partial \theta} &
  \frac{\partial \bfr_2}{\partial \bfx_0} &  \frac{\partial \bfr_2}{\partial y_0} &  \frac{\partial \bfr_2}{\partial \bfx_1} &  \frac{\partial \bfr_2}{\partial y_1} &  \frac{\partial \bfr_2}{\partial \bfx_2} \\  &   &   &   & f'(\bfx_2) & -1 \\  \vdots & & & & & & \ddots \\  \frac{\partial \bfr_t}{\partial \theta} &
  \frac{\partial \bfr_t}{\partial \bfx_0} &  \frac{\partial \bfr_t}{\partial y_0} &  \frac{\partial \bfr_t}{\partial \bfx_1} &  \frac{\partial \bfr_t}{\partial y_1} &  \frac{\partial \bfr_t}{\partial \bfx_2} &  \cdots &  \frac{\partial \bfr_t}{\partial \bfx_{t-1}} &  \frac{\partial \bfr_t}{\partial \bfx_t} \\  &   &   &   & & & & f'(\bfx_t) & -1 \end{bmatrix}\begin{bmatrix}
  \delta \theta \\  \delta \bfx_0 \\ \delta y_0 \\  \delta \bfx_1 \\ \delta y_1 \\  \delta \bfx_2 \\ \delta y_2 \\  \vdots \\  \delta \bfx_t \\ \delta y_t\end{bmatrix} = 0.$$

We are interested in sensitivity with respect to $\delta \bfx_0$ and $\delta \theta$, so we rewrite this as$$\begin{bmatrix}  -1 \\  \frac{\partial \bfr_1}{\partial y_0} &  \frac{\partial \bfr_1}{\partial \bfx_1} \\    & f'(\bfx_1) & -1 \\  \frac{\partial \bfr_2}{\partial y_0} &  \frac{\partial \bfr_2}{\partial \bfx_1} &  \frac{\partial \bfr_2}{\partial y_1} &  \frac{\partial \bfr_2}{\partial \bfx_2} \\   &   &   & f'(\bfx_2) & -1 \\  \vdots & & & & & \ddots \\  \frac{\partial \bfr_t}{\partial y_0} &  \frac{\partial \bfr_t}{\partial \bfx_1} &  \frac{\partial \bfr_t}{\partial y_1} &  \frac{\partial \bfr_t}{\partial \bfx_2} &  \cdots &  \frac{\partial \bfr_t}{\partial \bfx_{t-1}} &  \frac{\partial \bfr_t}{\partial \bfx_t} \\   &   &   & & & & f'(\bfx_t) & -1 \end{bmatrix}\begin{bmatrix}  \delta y_0 \\  \delta \bfx_1 \\ \delta y_1 \\  \delta \bfx_2 \\ \delta y_2 \\  \vdots \\  \delta \bfx_t \\ \delta y_t\end{bmatrix} = 
-\begin{bmatrix}
  0 \\
  \frac{\partial \bfr_1}{\partial \theta} \\
  0 \\ 
  \frac{\partial \bfr_2}{\partial \theta} \\
  0 \\
  \vdots \\
  \frac{\partial \bfr_t}{\partial \theta} \\
  0
\end{bmatrix} \delta \theta-\begin{bmatrix}  f'(\bfx_0) \\  \frac{\partial \bfr_1}{\partial \bfx_0} \\  0 \\   \frac{\partial \bfr_2}{\partial \bfx_0} \\  0 \\  \vdots \\  \frac{\partial \bfr_t}{\partial \bfx_0} \\  0\end{bmatrix} \delta \bfx_0.$$

We can write this more concisely as $L v = -q \, \delta \theta -g \, \delta \bfx_0,$ where $L$ is a (block) lower triangular matrix, $v$ is the vector of variations, and $-q \, \delta \theta -g \, \delta \bfx_0$ is the right hand side. We are interested in derivatives with respect to $x_0$ and $\theta$, i.e.:
$$\begin{aligned}
\delta y_t = -e_m L^{-1} g \delta x_0 \;\land\;
\delta y_t = -e_m L^{-1} q \delta \theta
\end{aligned}$$
where $e_m$ is the last column of the identity matrix.

## Inquiry

$$
\newcommand{\bfx}{\mathbf{x}}
\newcommand{\bfr}{\mathbf{r}}
$$

The above assumes the same hyperparameter per solve of our base policy. Here, we want to consider the case where each solve is with respect to a unique hyperparameter, i.e.
$$
\bfr_j(\bfx_j, \theta_j; \bfx_0, y_0, \ldots, \bfx_{j-1}, y_{j-1}) = 0
$$

## Recommendations from David
Make a table of notations. Use macros when typesetting for convenience. So things match what I write in code. Appendix and main document to stay consistent via macros.

New projects?
* Variants on Knowledge Gradient
* Optimal Selection of Exploration Hyperparameters
    * Learning to Learn paper. Rollout on $\theta$
* Cost-Constrained Non-Myopic Bayesian Optimization
    * Stochastic Model for Cost-Function 
* Cloud Resource Management
* Sampling Different GPs with Different Model Hyperparameters
    * Cheap to do with respect to what we have now with rollout.

Universal Kriging. RNN for Scheduling Policy as a function of function characteristics. Maybe train a neural net on the rollout data and function characteristics. Some set of hypothesized function classes, not that it is a particular function. Now let me look at data and see which function class my current thing belongs to. Bayesian Information Criterion for model selection. Akaike Information Criterion. Minimum Description Length.

Adaptive experimentation. Assume a distribution over function families.

David and I are kicking around some things. Would love to chat about them. Is there a good time? 3 way if the schedules intersect.

Play around with something that is less Monte-Carlo. Maybe Laplace Approximation around the mean values of the rollout. This will give us something about the uncertainty (I think).

In [12]:
using ForwardDiff
using LinearAlgebra

In [13]:
function centered_fd(f, u, du, h)
    (f(u+h*du)-f(u-h*du))/(2h)
end

centered_fd (generic function with 1 method)

In [31]:
f(x) = exp(x[1]) * x[2]^2
g(x) = ForwardDiff.gradient(f, x)
mixed_partials(x) = ForwardDiff.jacobian(g, x)

mixed_partials (generic function with 1 method)

In [27]:
x = [1., 1.]
h = 1e-6
δx = rand(2)

gx = dot(g(x), δx)
gx_fd = centered_fd(f, x, δx, h)

5.553753741782685

In [29]:
norm(gx - gx_fd) / (norm(gx))

6.62475504626509e-11

In [33]:
mixed_partials(x)

2×2 Matrix{Float64}:
 2.71828  5.43656
 5.43656  5.43656

In [36]:
outer

LoadError: UndefVarError: `outer` not defined