# Derivatives for expected improvement

## Sanity Checks
- Make sure we do newton iteration on the original problem

Throughout these notes, we consider the derivatives of the expected improvement function needed
for Newton's method and for differentiation of the argmax with respect to data and hyperparameters.
We write the expected improvement acquisition function as
$$\begin{aligned}
  \alpha(x) &= \sigma(x) g(z(x)) \\
  g(z) &= z \Phi(z) + \phi(z) \\
  z(x) &= \sigma(x)^{-1} \left[ \mu(x) - f^+ - \xi \right]
\end{aligned}$$
where $\Phi$ and $\phi$ denote the standard normal CDF and PDF, respectively; $f^+$ is the best function value
found so far; and $\xi$ is a parameter to encourage additional exploration.

Throughout this note, we will use the notation $f_{,i}$ to denote $\partial f / \partial x_i$, and $\dot{f}$ to denote
differentiation with respect to data or an arbitrary hyperparameter.  Except in this initial paragraph, we will 
generally suppress the parameter $x$, leaving it implicit.  Our goal is two-fold:

1.  We want to compute the derivatives necessary for Newton iteration on the problem of maximizing $\alpha$;
    that is, we want the gradient components $\alpha_{,i}$ and the Hessian components $\alpha_{,ij}$.
2.  Given $x^*$ such that $\alpha_{,i}(x^*) = 0$, we want to view $x^*$ as an implicit function of the
    data and input hyper-parameters, and compute derivatives of $x^*$ via implicit differentiation:
    $$
      \alpha_{,ij} \dot{x}_j^* + \dot{\alpha}_{,i} = 0.
    $$

We will structure the computation from the bottom up, first differentiating the kernel function,
then the predictive mean and variance, then $z$, and finally $\alpha$.

In [4]:
using Distributions
using LinearAlgebra
using Plots

┌ Info: Precompiling Plots [91a5bcdd-55d7-5caf-9e0b-520d859cae80]
└ @ Base loading.jl:1278


## Kernel derivatives

We assume the kernel has the form $k(x,y) = \psi(\rho)$ where $\rho = \|r\|$ and $r = x-y$.  Recall that
$$\begin{aligned}
  \rho &= \sqrt{r_k r_k} \\
  \rho_{,i} &= \rho^{-1} r_k r_{k,i} = \rho^{-1} r_i \\
  \rho_{,ij} &= \rho^{-1} \delta_{ij} - \rho^{-2} r_i \rho_{,j} \\
             &= \rho^{-1} \left[ \delta_{ij} - \rho^{-2} r_i r_j \right]
\end{aligned}$$
Applying this together with the chain rule yields
$$\begin{aligned}
  k &= \psi(\rho) \\
  k_{,i} &= \psi'(\rho) \rho_{,i} = \psi'(\rho) \rho^{-1} r_i \\
  k_{,ij} &= \psi''(\rho) \rho_{,i} \rho_{,j} + \psi'(\rho) \rho_{,ij} \\
          &= \left[ \psi''(\rho) - \rho^{-1} \psi'(\rho) \right] \rho^{-2} r_i r_j + \rho^{-1} \psi'(\rho) \delta_{ij}.
\end{aligned}$$

In [5]:
ψ(ρ; ℓ=1.0, σref=1.0) = σref*exp((ρ/ℓ)^2/2)
dψ(ρ; ℓ=1.0, σref=1.0) = ψ(ρ, ℓ=ℓ, σref=σref) * (ρ/(ℓ^2))
d2ψ(ρ; ℓ=1.0, σref=1.0) = (1/ℓ^2)ψ(ρ, ℓ=ℓ, σref=σref) * (1 + ρ^2/ℓ^2)

# Perturbations to hypers
δψ(ρ; ℓ=1.0, σref=1.0) = ψ(ρ, ℓ=ℓ, σref=σref) * (-ρ^2/ℓ^3)
# δdψ(ρ; ℓ=1.0, σref=1.0) = ψ(ρ, ℓ=ℓ, σref=σref) * -ρ/ℓ^2 * ( ρ^2/ℓ^2 + 1 )
δdψ(ρ; ℓ=1.0, σref=1.0) = ψ(ρ, ℓ=ℓ, σref=σref)*(-ρ/ℓ^3) * (ρ^2/ℓ^2 + 2)

ρ = 1.23
h = 1e-4
fd_dψρ = ( ψ(ρ+h)-ψ(ρ-h) )/(2h)
fd_d2ψρ = ( ψ(ρ+h)-2*ψ(ρ)+ψ(ρ-h) )/h^2
relerr_dψρ = (dψ(ρ) - fd_dψρ)/dψ(ρ)
relerr_d2ψρ = (d2ψ(ρ) - fd_d2ψρ)/d2ψ(ρ)

println("Finite difference check on dψ:  $relerr_dψρ")
println("Finite difference check on d2ψ: $relerr_d2ψρ")

fd_δψ = (ψ(ρ; ℓ=1.3+h)-ψ(ρ; ℓ=1.3-h))/(2h)
fd_δdψ = (dψ(ρ; ℓ=1.3+h)-dψ(ρ; ℓ=1.3-h))/(2h)

println("Finite difference check on δψ: $( (δψ(ρ, ℓ=1.3)-fd_δψ)/δψ(ρ, ℓ=1.3) )")
println("Finite difference check on δdψ: $( (δdψ(ρ, ℓ=1.3)-fd_δdψ)/δdψ(ρ, ℓ=1.3) )")

Finite difference check on dψ:  -7.521748047190917e-9
Finite difference check on d2ψ: -6.586409015808244e-10
Finite difference check on δψ: -2.05701026289355e-8
Finite difference check on δdψ: -2.7150549471461273e-8


In [20]:
k(x, y; ℓ=1.0, σref=1.0) = ψ(norm(x-y), ℓ=ℓ, σref=σref)

function ∇k(x, y; ℓ=1.0, σref=1.0)
    ρ = norm(x-y)
    return dψ(ρ, ℓ=ℓ, σref=σref) * (x-y)/ρ
end

function Hk(x, y; ℓ=1.0, σref=1.0)
    r = x-y
    ρ = norm(r)
    return (d2ψ(ρ) - dψ(ρ)/ρ)/ρ^2 * r * r' + dψ(ρ)/ρ * I
end

δk(x, y; ℓ=1.0, σref=1.0) = δψ(norm(x-y), ℓ=ℓ, σref=σref)

function δ∇k(x, y; ℓ=1.0, σref=1.0)
    ρ = norm(x-y)
    return δdψ(ρ, ℓ=ℓ, σref=σref) * (x-y)/ρ
end

x = rand(2)
y = rand(2)
u = rand(2)

dk_du = ∇k(x, y)'*u
fd_dk_du = (k(x+h*u, y)-k(x-h*u, y))/(2h)
println("Check on derivative: $( (dk_du-fd_dk_du)/dk_du )")

d2k_du2 = u'*Hk(x, y)*u
fd_d2k_du2 = ( k(x+h*u, y)-2*k(x, y)+k(x-h*u, y) )/h^2
alt_fd_d2k_du2 = ( k(x-2h*u, y) + 4k(x-h*u, y) - 10k(x, y) + 4k(x+h*u, y) + k(x+2h*u, y) ) / 8h^2
println("Check second derivative: $( (d2k_du2-fd_d2k_du2)/d2k_du2 )")
println("Check second derivative (higher-order approx): $(( d2k_du2-alt_fd_d2k_du2) / d2k_du2)")
println("Absolute Difference: $(abs(alt_fd_d2k_du2 - fd_d2k_du2))")

Check on derivative: -6.212582202839114e-9
Check second derivative: 1.3520378446791313e-9
Check second derivative (higher-order approx): -5.94774394913335e-9
Absolute Difference: 1.1102230246251565e-8


In [19]:
abs

abs (generic function with 16 methods)

## Mean derivatives

Let $K_{XX}$ denote the kernel matrix, and $k_{Xx}$ the column vector of kernel evaluations at $x$.
The posterior mean function for the GP (assuming a zero-mean prior) is
$$
  \mu = k_{xX} c
$$
where $K_{XX} c = y$.  Note that $c$ does not depend on $x$, but it does depend on the data and hyperparameters.

Differentiating in space is straightforward, as we only invoke the kernel derivatives:
$$\begin{aligned}
  \mu_{,i} &= k_{xX,i} c \\
  \mu_{,ij} &= k_{xX,ij} c
\end{aligned}$$
Differentiating in the data and hyperparameters requires that we also differentiate through a matrix solve:
$$
  \dot{\mu} = \dot{k}_{xX} K_{XX}^{-1} y + k_{xX} K_{XX}^{-1} \dot{y} - k_{xX} K_{XX}^{-1} \dot{K}_{XX} K_{XX}^{-1} y.
$$
Defining $d = K_{XX}^{-1} k_{Xx}$, we have
$$
  \dot{\mu} = \dot{k}_{xX} c + d^T (\dot{y} - \dot{K}_{XX} c).
$$
Now differentiating in space and defining $K_{XX}^{-1} k_{Xx,i}$ as $w^{(i)}$, we have
$$
  \dot{\mu}_{,i} = \dot{k}_{xX,i} c + (w^{(i)})^T (\dot{y} - \dot{K}_{XX} c).
$$

### Darian's Question
Why is $(\dot{y} - \dot{K}_{XX} c)$ treated as a constant when differentiating in space? Don't these values depend on the spatial coordinates as well? Similarly, for $c$.

In [4]:
function kernel_matrix(X, Y; ℓ=1.0, σref=1.0)
    n = size(X)[2]
    m = size(Y)[2]
    K = zeros(n, m)
    for i = 1:n
        for j = 1:m
            K[i,j] = k(X[:,i], Y[:,j], ℓ=ℓ, σref=σref)
        end
    end
    return K
end

function μ(x, X, c; ℓ=1.0, σref=1.0)
    μx = 0.0
    for i = 1:size(X)[2]
        μx += c[i] * k(x, X[:,i], ℓ=ℓ, σref=σref)
    end
    return μx
end

function ∇μ(x, X, c; ℓ=1.0, σref=1.0)
    ∇μx = zeros(length(x))
    for i = 1:size(X)[2]
        ∇μx += c[i] * ∇k(x, X[:,i], ℓ=ℓ, σref=σref)
    end
    return ∇μx
end

function Hμ(x, X, c; ℓ=1.0, σref=1.0)
    Hμx = zeros(length(x), length(x))
    for i = 1:size(X)[2]
        Hμx += c[i] * Hk(x, X[:,i], ℓ=ℓ, σref=σref)
    end
    return Hμx
end

# Set up a test problem

X = rand(2, 10)
y = X[1,:] + 2*X[2,:]
KXX = kernel_matrix(X, X, ℓ=1.0, σref=1.0)
c = KXX\y

x = rand(2)
println("μ($x) = $(μ(x, X, c)) ≈ $(x[1] + 2*x[2])]")
println("∇μ($x) = $(∇μ(x, X, c)) ≈ [1, 2]")
println("Hμ($x) = $(Hμ(x, X, c)) ≈ 0 matrix")

u = rand(2)
d2μ_du2 = u'*Hμ(x, X, c)*u
fd_d2μ_du2 = (μ(x+2h*u, X, c) - 2μ(x, X, c) + μ(x-2h*u, X, c)) / 4h^2
relerr_Hμ = (d2μ_du2 - fd_d2μ_du2) / d2μ_du2
println("Finite difference check on Hμ: $(relerr_Hμ)")

# Finite difference check for Hμ
fd_∇μ_du = u'*( ∇μ(x+h*u, X, c) - ∇μ(x-h*u, X, c) ) / (2h)
relerr_Hμ = (d2μ_du2 - fd_∇μ_du) / d2μ_du2
println("Finite difference checn on Hμ again: $(relerr_Hμ)")

μ([0.018578033193536347, 0.26736391890838984]) = 0.53041342106863 ≈ 0.553305871010316]
∇μ([0.018578033193536347, 0.26736391890838984]) = [1.2334316901768432, 1.9599927137540192] ≈ [1, 2]
Hμ([0.018578033193536347, 0.26736391890838984]) = [-1.2223683352470402 -0.08819176500019665; -0.08819176500019665 0.2774708558645216] ≈ 0 matrix
Finite difference check on Hμ: -1.4027299261117364e-6
Finite difference checn on Hμ again: -5.309392882658419e-9


In [5]:
# δk = dk/dl * l̇
function δμ(x, X, c, ẏ, l̇; ℓ=1.0, σref=1.0)
    δμx = 0.0
    KXX = kernel_matrix(X, X)
    KXx = kernel_matrix(X, reshape(x, 2, 1))
    d = KXX \ KXx
    
    for i = 1:size(X)[2]
        δμx += c[i] * δk(x, X[:, i]) * l̇ + d[i]*ẏ[i]
        for j = 1:size(X)[2]
            δμx -= d[i] * δk(X[:, i], X[:, j]) * l̇ * c[j] 
        end
    end
    
    return δμx
end

function δ∇μ(x, X, c, ẏ, l̇; ℓ=1.0, σref=1.0)
    δ∇μx = zeros(length(x))
    W = zeros(size(X))
    
    for ndx = 1:size(X)[2]
        W[:, ndx] = ∇k(x, X[:, ndx])
        δ∇μx += δ∇k(x, X[:, ndx]) * c[ndx] * l̇
    end
    
    W /= kernel_matrix(X, X, ℓ=ℓ, σref=σref)
    z = copy(ẏ)
    
    for i = 1:size(X)[2]
        for j = 1:size(X)[2]
            z[i] -= δk(X[:, i], X[:, j]) * l̇ * c[j]
        end
    end
    
    δ∇μx += W*z
    
    return δ∇μx
end

l̇ = rand()
ẏ = rand(length(c))
c = kernel_matrix(X, X, ℓ=1.0) \ y
cplus = kernel_matrix(X, X, ℓ=1.0+h*l̇) \ (y + h*ẏ)
cminus = kernel_matrix(X, X, ℓ=1.0-h*l̇) \ (y - h*ẏ)

δμ_test = δμ(x, X, c, ẏ, l̇)
fd_δμ = ( μ(x, X, cplus, ℓ=1.0+h*l̇) - μ(x, X, cminus, ℓ=1.0-h*l̇) ) / (2h)
relerr = (δμ_test - fd_δμ) / δμ_test
println("Fininte difference check for δμ: $relerr")

# Finite difference on δμ
u = rand(length(x))
δ∇μ_test = u'*δ∇μ(x, X, c, ẏ, l̇)
fd_δ∇μ = ( δμ(x+h*u, X, c, ẏ, l̇) - δμ(x-h*u, X, c, ẏ, l̇) ) / (2h)
relerr = (δ∇μ_test - fd_δ∇μ) / δ∇μ_test 
println("Fininte difference check for δ∇μ: $relerr")

Fininte difference check for δμ: -1.516062364136845e-8
Fininte difference check for δ∇μ: -3.059476965516987e-8


## Standard deviation derivatives

The predictive variance is
$$
  \sigma^2 = k_{xx} - k_{xX} K_{XX}^{-1} k_{Xx}.
$$
Differentiating the predictive variance twice in space --- assuming $k_{xx}$ is independent of $x$ by stationarity ---
gives us
$$\begin{aligned}
  2 \sigma \sigma_{,i} &= -2 k_{xX,i} K_{XX}^{-1} k_{Xx} = -2 k_{xX,i} d \\
  2 \sigma_{,i} \sigma_{,j} + 2 \sigma \sigma_{,ij} &= -2 k_{xX,ij} K_{XX}^{-1} k_{Xx} - 2 k_{xX,i} K_{XX}^{-1} k_{Xx,j} \\
                     &= -2 k_{xX,ij} d -2 k_{xX,i} w^{(j)}
\end{aligned}$$
Rearranging to get spatial derivatives of $\sigma$ on their own gives us
$$\begin{aligned}
  \sigma_{,i} &= -\sigma^{-1} k_{xX,i} d \\
  \sigma_{,ij} &= -\sigma^{-1} \left[ k_{xX,ij} d + k_{xX,i} w^{(j)} + \sigma_{,i} \sigma_{,j} \right].
\end{aligned}$$

Differentiating with respect to data (and locations) and kernel hypers requires more work.  First, note that
$$\begin{aligned}
  2 \sigma \dot{\sigma} 
  &= \dot{k}_{xx} - 2 \dot{k}_{xX} K_{XX}^{-1} k_{Xx} + k_{xX} K_{XX}^{-1} \dot{K}_{XX} K_{XX}^{-1} k_{Xx} \\
  &= \dot{k}_{xx} - 2 \dot{k}_{xX} d + d^T \dot{K}_{XX} d
\end{aligned}$$
Now, differentiating $\sigma^{-1}$ with respect to data and hypers gives
$$\begin{aligned}
  \dot{\sigma}_{,i} 
  &= \sigma^{-2} \dot{\sigma} k_{xX,i} K_{XX}^{-1} k_{Xx} -
     \sigma^{-1} \left[ 
       \dot{k}_{xX,i} K_{XX}^{-1} k_{Xx} +
       k_{xX,i} K_{XX}^{-1} \dot{k}_{Xx} -
       k_{xX} K_{XX}^{-1} \dot{K}_{XX} K_{XX}^{-1} k_{Xx} \right] \\
  &= -\sigma^{-1} \left[ \dot{\sigma} \sigma_{,i} + \dot{k}_{xX,i} d + (w^{(i)})^T \dot{k}_{Xx} - d^T \dot{K}_{XX} d \right]
\end{aligned}$$

My results when differentiating $\sigma^{-1}$ with respect to data and hypers gives
$$\begin{aligned}
  \dot{\sigma}_{,i} &= -\sigma^{-1} \left[ \dot{\sigma} \sigma_{,i} -
                       (w^{(i)})^T \dot{K}_{XX} d + (w^{(i)})^T \dot{k}_{Xx} + \dot{K}_{xX,i}d
  \right]
\end{aligned}$$

In [6]:
function Hσ(x, X; ℓ=1.0, σref=1.0)
    Hσx = zeros(length(x), length(x))
    KXx = kernel_matrix(X, reshape(x, length(x), 1))
    d = kernel_matrix(X, X) \ KXx
    
    W = zeros(size(X))
    for col = 1:size(W)[2]
       W[:, col] = ∇k(x, X[:, col]) 
    end
    W /= kernel_matrix(X, X, ℓ=ℓ, σref=σref)
    
    for i = 1:size(X)[2]
        Hσx += Hk(x, X[:, i])*d[i] + ∇k(x, X[:, i])*W[:, i]'
    end
    
    Hσx += ∇σ(x, X)*∇σ(x, X)'
    Hσx ./= -σ(x, X)
    
    return Hσx
end

function δ∇σ(x, X, l̇; ℓ=1.0, σref=1.0)
    δ∇σx = δσ(x, X, l̇) * ∇σ(x, X)
    KXX = kernel_matrix(X, X)
    KXx = kernel_matrix(X, reshape(x, length(x), 1))
    d = KXX \ KXx
    
    W = zeros(size(X))
    for ndx = 1:size(X)[2]
        W[:, ndx] = ∇k(x, X[:, ndx])
    end
    W /= kernel_matrix(X, X, ℓ=ℓ, σref=σref)
    
    z0 = zeros(length(x))
    z1 = zeros(size(X)[2])
    z2 = zeros(size(X)[2])
    
    for i = 1:size(X)[2]
        z0 += δ∇k(x, X[:, i]) * d[i] * l̇
        z2[i] = δk(x, X[:, i]) * l̇
        for j = 1:size(X)[2]
            z1[i] += δk(X[:, i], X[:, j]) * d[j] * l̇
        end
    end
    
    δ∇σx += -W*z1 + W*z2 + z0
    δ∇σx /= -σ(x, X)

    return δ∇σx
end

function δσ(x, X, l̇; ℓ=1.0, σref=1.0)
    δσx = δk(x, x) * l̇
    KXX = kernel_matrix(X, X)
    KXx = kernel_matrix(X, reshape(x, 2, 1))
    d = KXX \ KXx
    
    for i = 1:size(X)[2]
        δσx -= 2δk(x, X[:, i]) * d[i] * l̇
        for j = 1:size(X)[2]
            δσx += d[i] * d[j] * δk(X[:, i], X[:, j]) * l̇
        end
    end
    
    δσx /= 2σ(x, X)
    return δσx
end

function σ(x, X; ℓ=1.0, σref=1.0)
    KXX = kernel_matrix(X, X, ℓ=ℓ, σref=σref)
    KXx = kernel_matrix(X, reshape(x, length(x), 1), ℓ=ℓ, σref=σref)
    return √(k(x, x) - dot(KXx, KXX \ KXx))
end

function ∇σ(x, X; ℓ=1.0, σref=1.0)
    ∇σx = zeros(length(x))
    KXx = kernel_matrix(X, reshape(x, length(x), 1))
    d = kernel_matrix(X, X) \ KXx
    
    for i = 1:size(X)[2]
        ∇σx += d[i] * ∇k(x, X[:, i], ℓ=ℓ, σref=σref)
    end
    
    ∇σx /= -σ(x, X, ℓ=ℓ, σref=σref)
    
    return ∇σx
end

# Finite difference check on σ wrt to spatial coordinates
u = rand(length(x))
∇σ_test = u'*∇σ(x, X)
fd_dσ_du = ( σ(x+h*u, X) - σ(x-h*u, X) ) / (2h)
relerr = (∇σ_test - fd_dσ_du) / ∇σ_test
println("Finite difference check on ∇σ: $relerr")

# Finite difference check (of hessian) on σ wrt to spatial coordinates
u = rand(length(x))
Hσ_test = u'*Hσ(x, X)*u
fd_d2σ_du2 = ( σ(x+2h*u, X) - 2σ(x, X) + σ(x-2h*u, X) ) / (4h^2)
relerr = (Hσ_test - fd_d2σ_du2) / Hσ_test
println("Finite difference check on Hσ: $relerr")

# Finite difference check (of hessian) against gradient
fd_∇σ_du = u' * ( ∇σ(x+h*u, X) - ∇σ(x-h*u, X) ) / (2h)
relerr = (Hσ_test - fd_∇σ_du) / Hσ_test
println("Finite difference check on Hσ again: $relerr")

# Finite difference check wrt to hypers
l̇ = rand()
δσ_test = δσ(x, X, l̇)
fd_δσ_dl = ( σ(x, X; ℓ=1.0+h*l̇) - σ(x, X; ℓ=1.0-h*l̇) ) / (2h)
relerr = (δσ_test - fd_δσ_dl) / δσ_test
println("Finite difference check on δσ: $relerr")

# Finite difference check on δσ
l̇ = rand()
u = rand(length(x))
δ∇σ_test = dot(u, δ∇σ(x, X, l̇))
fd_δ∇σ_dx = ( δσ(x+h*u, X, l̇) - δσ(x-h*u, X, l̇) ) / (2h)
relerr = (δ∇σ_test - fd_δ∇σ_dx) / δ∇σ_test
println("Finite difference check on δ∇σ: $relerr")

Finite difference check on ∇σ: -6.522099880309071e-8
Finite difference check on Hσ: -0.0001340400725940369
Finite difference check on Hσ again: -8.993242623767266e-12
Finite difference check on δσ: -2.513051587136889e-7
Finite difference check on δ∇σ: -2.9911796875382046e-8


## Differentiating $z$

Now consider $z = \sigma^{-1} [\mu - f^+ - \xi]$.  As before, we begin with spatial derivatives:
$$\begin{aligned}
  z_{,i} &= -\sigma^{-2} \sigma_{,i} [\mu - f^+ - \xi] + \sigma^{-1} \mu_{,i} \\
         &= \sigma^{-1} \left[ \mu_{,i} - \sigma_{,i} z \right] \\
  z_{,ij} &= -\sigma^{-2} \sigma_{,j} \left[ \mu_{,i} - \sigma_{,i} z \right] + 
             \sigma^{-1} \left[\mu_{,ij} - \sigma_{,ij} z - \sigma_{,i} z_{,j} \right] \\
          &= \sigma^{-1} \left[ \mu_{,ij} - \sigma_{,ij} z - \sigma_{,i} z_{,j} - \sigma_{,j} z_{,i} \right]
\end{aligned}$$
Now we differentiate with respect to data and hypers:
$$\begin{aligned}
  \dot{z} &= -\sigma^{-2} \dot{\sigma} [\mu - f^+ - \xi] + \sigma^{-1} [\dot{\mu} - \dot{f}^+ - \dot{\xi}] \\
          &= \sigma^{-1} [\dot{\mu} - \dot{f}^+ - \dot{\xi} - \dot{\sigma} z] \\
  \dot{z}_{,i} &= \sigma^{-1} [\dot{\mu}_{,i} -\dot{\sigma}_{,i} z - \dot{\sigma} z_{,i} -\sigma_{,i} \dot{z}]
\end{aligned}$$

In [7]:
z(x, X, c, f⁺, ξ; ℓ=1.0, σref=1.0) = (1/σ(x, X, ℓ=ℓ, σref=σref)) * (μ(x, X, c, ℓ=ℓ, σref=σref) - f⁺ - ξ)
∇z(x, X, c, f⁺, ξ; ℓ=1.0, σref=1.0) = (1/σ(x, X, ℓ=ℓ, σref=σref)) * (∇μ(x, X, c, ℓ=ℓ, σref=σref) - ∇σ(x, X, ℓ=ℓ, σref=σref)
    * z(x, X, c, f⁺, ξ, ℓ=ℓ, σref=σref)
)
Hz(x, X, c, f⁺, ξ; ℓ=1.0, σref=1.0) = (1/σ(x, X, ℓ=ℓ, σref=σref)) * (
    Hμ(x, X, c, ℓ=ℓ, σref=σref) - Hσ(x, X, ℓ=ℓ, σref=σref)*z(x, X, c, f⁺, ξ, ℓ=ℓ, σref=σref) - 
    ∇σ(x, X, ℓ=ℓ, σref=σref)*∇z(x, X, c, f⁺, ξ, ℓ=ℓ, σref=σref)' - (∇z(x, X, c, f⁺, ξ, ℓ=ℓ, σref=σref)*∇σ(x, X, ℓ=ℓ, σref=σref)')
)
δz(x, X, c, f⁺, ξ, l̇, ẏ; ḟ⁺=0.0, ξ̇=0.0, ℓ=1.0, σref=1.0) = (1/σ(x, X, ℓ=ℓ, σref=σref)) * (
    δμ(x, X, c, ẏ, l̇, ℓ=ℓ, σref=σref) - ḟ⁺ - ξ̇ - δσ(x, X, l̇, ℓ=ℓ, σref=σref)*z(x, X, c, f⁺, ξ, ℓ=ℓ, σref=σref)
)
δ∇z(x, X, c, f⁺, ξ, l̇, ẏ; ḟ⁺=0.0, ξ̇=0.0, ℓ=1.0, σref=1.0) = (1/σ(x, X, ℓ=ℓ, σref=σref)) * (
    δ∇μ(x, X, c, ẏ, l̇, ℓ=ℓ, σref=σref) - δ∇σ(x, X, l̇, ℓ=ℓ, σref=σref)*z(x, X, c, f⁺, ξ, ℓ=ℓ, σref=σref) -
    δσ(x, X, l̇, ℓ=ℓ, σref=σref)*∇z(x, X, c, f⁺, ξ, ℓ=ℓ, σref=σref) - ∇σ(x, X, ℓ=ℓ, σref=σref)*δz(x, X, c, f⁺, ξ, l̇, ẏ, ḟ⁺=ḟ⁺, ξ̇=ξ̇, ℓ=ℓ, σref=σref)
)

# Finite difference check for ∇z
u = rand(length(x))
f⁺, ξ = [0.0, 0.0]
∇z_test = dot(u, ∇z(x, X, c, f⁺, ξ))
fd_dz_dx = ( z(x+h*u, X, c, f⁺, ξ) - z(x-h*u, X, c, f⁺, ξ) ) / (2h)
relerr = (∇z_test - fd_dz_dx) / ∇z_test
println("Finite difference check for ∇z: $relerr")

# Finite difference check for Hz
# Hz_test = u'*Hz(x, X, c, f⁺, ξ)*u
# fd_d2z_dx2 = ( z(x+2h*u, X, c, f⁺, ξ) - 2z(x, X, c, f⁺, ξ) + z(x-2h*u, X, c, f⁺, ξ) ) / (4h^2)
# relerr = (Hz_test - fd_d2z_dx2) / Hz_test
# println("Finite difference check for Hz: $relerr")

# Finite difference check for Hz using ∇z
Hz_test = u'*Hz(x, X, c, f⁺, ξ)*u
fd_∇z_du = u' * ( ∇z(x+h*u, X, c, f⁺, ξ) - ∇z(x-h*u, X, c, f⁺, ξ) ) / (2h)
relerr = (Hz_test - fd_∇z_du) / Hz_test
println("Finite difference check for Hz again: $relerr")

# Finite difference check for δz
l̇ = rand()
ẏ = rand(length(y))
cplus = kernel_matrix(X, X, ℓ=1.0+h*l̇) \ (y + h*ẏ)
cminus = kernel_matrix(X, X, ℓ=1.0-h*l̇) \ (y - h*ẏ)
δz_test = δz(x, X, c, f⁺, ξ, l̇, ẏ)
fd_dz_dl = ( z(x, X, cplus, f⁺, ξ, ℓ=1.0+h*l̇) - z(x, X, cminus, f⁺, ξ, ℓ=1.0-h*l̇) ) / (2h)
relerr = (δz_test - fd_dz_dl) / δz_test
println("Finite difference check for δz: $relerr")

# Finite difference check for mixed derivative δ∇z
δ∇z_test = u' * δ∇z(x, X, c, f⁺, ξ, l̇, ẏ)
# fd_∇z_dl = u' * ( ∇z(x, X, cplus, f⁺, ξ, ℓ=1.0+h*l̇) - ∇z(x, X, cminus, f⁺, ξ, ℓ=1.0-h*l̇) ) / (2h)
fd_δz_dx = ( δz(x+h*u, X, c, f⁺, ξ, l̇, ẏ) - δz(x-h*u, X, c, f⁺, ξ, l̇, ẏ) ) / (2h)
# relerr = (δ∇z_test - fd_∇z_dl) / δ∇z_test
orelerr = (δ∇z_test - fd_δz_dx) / δ∇z_test
# println("Finite difference check for δ∇z: $relerr")
println("Finite difference check for δ∇z: $orelerr")

Finite difference check for ∇z: -3.297735439627194e-8
Finite difference check for Hz again: -1.6015299277977992e-7
Finite difference check for δz: -8.11372168554918e-8
Finite difference check for δ∇z: -7.21472222043576e-8


## Differentiating $\alpha$

Finally, we differentiate $\alpha = \sigma g(z)$.  As before, we start with spatial derivatives:
$$\begin{aligned}
  \alpha_{,i} &= \sigma_{,i} g(z) + \sigma g'(z) z_{,i} \\
  \alpha_{,ij} &= \sigma_{,ij} g(z) + \sigma_{,i} g'(z) z_{,j} + \sigma_{,j} g'(z) z_{,i} + \sigma g'(z) z_{,ij} + \sigma g''(z) z_{,i} z_{,j} \\
  &= \sigma_{,ij} g(z) + [\sigma_{,i} z_{,j} + \sigma_{,j} z_{,i} + \sigma z_{,ij}] g'(z) + \sigma g''(z) z_{,i} z_{,j}
\end{aligned}$$
We also may want the mixed derivative with respect to spatial coordinates and data and hypers:
$$\begin{aligned}
  \dot{\alpha}_{,i} &= \dot{\sigma}_{,i} g(z) + \sigma_{,i} g'(z) \dot{z} + \dot{\sigma} g'(z) z_{,i} + \sigma g''(z) \dot{z} z_{,i} + \sigma g'(z) \dot{z}_{,i}
\end{aligned}$$

Finally, we differentiate $g(z) = z \Phi(z) + \phi(z)$, noting that
$\phi'(z) = -z \phi(z)$ and $\Phi'(z) = \phi(z)$.  This gives
$$\begin{aligned}
  g(z) &= z \Phi(z) + \phi(z) \\
  g'(z) &= \Phi(z) + z \phi(z) + \phi'(z) = \Phi(z) \\
  g''(z) &= \phi(z).
\end{aligned}$$


In [8]:
g(z) = z * cdf(Normal(), z) + pdf(Normal(), z)
dg(z) = cdf(Normal(), z)
d2g(z) = pdf(Normal(), z)

# Finite difference check of g
z0 = z(x, X, c, f⁺, ξ; ℓ=1.0, σref=1.0)
dg_test = dg(z0)
fd_dg_dz = ( g(z0+h) - g(z0-h) ) / (2h)
relerr = (dg_test - fd_dg_dz) / dg_test
println("Finite difference check for g: $relerr")

z0 = 0.1
d2g_test = d2g(z0)
fd_d2g_dz2 = ( dg(z0+h) - dg(z0-h) ) / (2h)
relerr = (d2g_test - fd_d2g_dz2) / d2g_test
println("Finite difference check for dg: $relerr")

Finite difference check for g: 2.3305801732931286e-12
Finite difference check for dg: 1.6494457623861254e-9


In [9]:
function δ∇α(x, X, c, f⁺, ξ, l̇, ẏ; ḟ⁺=0.0, ξ̇=0.0, ℓ=1.0, σref=1.0)
    z0 = z(x, X, c, f⁺, ξ; ℓ=ℓ, σref=σref)
    gprime = dg(z0)
    ∇zx = ∇z(x, X, c, f⁺, ξ; ℓ=ℓ, σref=σref)
    δzx = δz(x, X, c, f⁺, ξ, l̇, ẏ, ḟ⁺=ḟ⁺, ξ̇=ξ̇, ℓ=ℓ, σref=σref)
    σx = σ(x, X, ℓ=ℓ, σref=σref)
    
    return δ∇σ(x, X, l̇, ℓ=ℓ, σref=σref)*g(z0) + ∇σ(x, X, ℓ=ℓ, σref=σref)*gprime*δzx +
    δσ(x, X, l̇, ℓ=ℓ, σref=σref)*gprime*∇zx + σx*d2g(z0)*δzx*∇zx + σx*gprime*δ∇z(x, X, c, f⁺, ξ, l̇, ẏ, ḟ⁺=ḟ⁺, ξ̇=ξ̇, ℓ=ℓ, σref=σref)
end

α(x, X, c, f⁺, ξ; ℓ=1.0, σref=1.0) = σ(x, X) * g(z(x, X, c, f⁺, ξ; ℓ=1.0, σref=1.0))

function ∇α(x, X, c, f⁺, ξ; ℓ=1.0, σref=1.0)
    z0 = z(x, X, c, f⁺, ξ; ℓ=ℓ, σref=σref)
    return ∇σ(x, X, ℓ=ℓ, σref=σref)*z0 + σ(x, X, ℓ=ℓ, σref=σref)*dg(z0)*∇z(x, X, c, f⁺, ξ; ℓ=ℓ, σref=σref)
end

function Hα(x, X, c, f⁺, ξ; ℓ=1.0, σref=1.0)
    z0 = z(x, X, c, f⁺, ξ; ℓ=ℓ, σref=σref)
    ∇zx = ∇z(x, X, c, f⁺, ξ, ℓ=ℓ, σref=σref)
    ∇σx = ∇σ(x, X, ℓ=ℓ, σref=σref)
    σx = σ(x, X, ℓ=ℓ, σref=σref)
    Hzx = Hz(x, X, c, f⁺, ξ; ℓ=ℓ, σref=σref)
    
    return Hσ(x, X, ℓ=ℓ, σref=σref)*g(z0) + (∇σx*∇zx' + ∇zx*∇σx' + σx * Hzx)*dg(z0) + σx*d2g(z0)*∇zx*∇zx'
end

# Finite difference check of ∇α
u = rand(length(x))
∇α_test = dot(u, ∇α(x, X, c, f⁺, ξ, ℓ=1.0, σref=1.0))
fd_dα_dx = ( α(x+h*u, X, c, f⁺, ξ; ℓ=1.0, σref=1.0) - α(x-h*u, X, c, f⁺, ξ; ℓ=1.0, σref=1.0) ) / (2h)
relerr = (∇α_test - fd_dα_dx) / ∇α_test
println("Finite difference check for ∇α: $relerr")

# Finite difference check of Hα
Hα_test = dot(u, u'*Hα(x, X, c, f⁺, ξ, ℓ=1.0, σref=1.0))
fd_d2α_dx2 = u' * ( ∇α(x+h*u, X, c, f⁺, ξ; ℓ=1.0, σref=1.0) - ∇α(x-h*u, X, c, f⁺, ξ; ℓ=1.0, σref=1.0) ) / (2h)
relerr = (Hα_test - fd_d2α_dx2) / Hα_test
println("Finite difference check for Hα: $relerr")

# Finite difference check for δ∇α
l̇ = rand()
ẏ = rand(length(y))
cplus = kernel_matrix(X, X, ℓ=1.0+h*l̇) \ (y + h*ẏ)
cminus = kernel_matrix(X, X, ℓ=1.0-h*l̇) \ (y - h*ẏ)
δ∇α_test = u' * δ∇α(x, X, c, f⁺, ξ, l̇, ẏ, ḟ⁺=0.0, ξ̇=0.0, ℓ=1.0, σref=1.0)
fd_∇α_dl = u' * ( ∇α(x, X, cplus, f⁺, ξ; ℓ=1.0+h*l̇, σref=1.0) - ∇α(x, X, cminus, f⁺, ξ; ℓ=1.0-h*l̇, σref=1.0) ) / (2h)
relerr = (δ∇α_test - fd_∇α_dl) / δ∇α_test
println("Finite difference check for δ∇α: $relerr")

Finite difference check for ∇α: -6.115390747921339e-10
Finite difference check for Hα: 8.863758924140912e-9
Finite difference check for δ∇α: -1.1119573588255964e-8


In [10]:
∇α(x, X, c, f⁺, ξ, ℓ=1.0, σref=1.0)

2-element Array{Float64,1}:
 1.2334316901768423
 1.9599927137540192

In [11]:
δ∇α(x, X, c, f⁺, ξ, l̇, ẏ, ḟ⁺=0.0, ξ̇=0.0, ℓ=1.0, σref=1.0)

2-element Array{Float64,1}:
 -59.95189363212512
  19.584864954679055