(LaTeX macro definitions)
$$
\newcommand{\lpa}{\left(}
\newcommand{\rpa}{\right)}
\newcommand{\lbr}{\left\lbrace}
\newcommand{\rbr}{\right\rbrace}
\newcommand{\lsb}{\left[}
\newcommand{\rsb}{\right]}
\newcommand{\dr}{\mathrm{d}}
\newcommand{\td}[2]{\frac{\dr #1}{\dr #2}}
\newcommand{\pd}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\pdd}[3]{\frac{\partial^2 #1}{\partial #2 \partial #3}}
\newcommand{\vct}[1]{\boldsymbol{#1}}
\newcommand{\mtx}[1]{\mathbf{#1}}
\newcommand{\tr}{^\mathrm{T}}
\newcommand{\onevct}{\vct{\mathit{1}}}
\newcommand{\zerovct}{\vct{\mathit{0}}}
\newcommand{\onemtx}{\mtx{1}}
\DeclareMathOperator{\diag}{diag}
\DeclareMathOperator{\Tr}{tr}
\newcommand{\map}[1]{\mathbf{#1}}
\newcommand{\set}[1]{\mathcal{#1}}
\newcommand{\fset}[1]{\lbr #1 \rbr}
\newcommand{\reals}{\mathbb{R}}
\newcommand{\naturals}{\mathbb{N}}
\newcommand{\ind}[1]{\mathbbm{1}\lsb #1 \rsb}
\newcommand{\bigo}[1]{\mathcal{O}\lpa #1\rpa}
\newcommand{\defas}{\overset{\underset{\mathrm{def}}{}}{=}}
\DeclareMathOperator{\realpart}{Re}
\DeclareMathOperator{\imagpart}{Im}
\newcommand{\prob}[1]{\mathbb{P}\lsb #1 \rsb}
\newcommand{\pdf}[2]{p_{#1}(#2)}
\newcommand{\rvar}[1]{\mathsf{#1}}
\newcommand{\rvct}[1]{\boldsymbol{\rvar{#1}}}
\newcommand{\nrm}[1]{\mathcal{N}\lpa #1 \rpa}
\newcommand{\gvn}{\,|\,}
\newcommand{\rng}[2]{\lbr #1 \dots #2 \rbr}
\newcommand{\expc}[2]{\mathbb{E}_{#1}\lsb #2 \rsb}
\newcommand{\var}[1]{\mathbb{V}\lsb #1 \rsb}
\newcommand{\asymvar}[2]{\sigma^2 \lsb #2 ; #1\rsb}
\newcommand{\ivdstsym}{\pi}
\newcommand{\ivdst}[2]{\ivdstsym_{#1} \lsb #2 \rsb}
$$

# Hamiltonian Monte Carlo

## A brief(ish) introduction

## Motivation

### Problem definition

Given some probability distribution defined on a real vector space $\reals^N$ by the (potentially unnormalised) density function

$$ 
  \ivdst{\rvct{x}}{\vct{x}} \propto
  \exp \lbr - \phi(\vct{x}) \rbr
$$

generate a set of samples $\fset{\vct{x}^{(i)}}_{i=1}^M$ from a Markov chain which has the distribution defined by $\ivdstsym_{\rvct{x}}$ as its unique invariant measure so that they can be used to compute Monte Carlo approximation to expectations with respect to this distribution 

$$
  \expc{\ivdstsym_{\rvct{x}}}{f} \approx \frac{1}{M} \sum_{i=1}^M \lbr f\lpa \vct{x}^{(i)} \rpa \rbr
$$

### Assumptions

  * Support of distribution is full vector space: $\ivdst{\rvct{x}}{\vct{x}} > 0 ~~\forall \vct{x} \in \reals^N$
    * If support is some bounded subset can sometimes transform to equivalent unconstrained space using variable transform.
  * Density function (and energy) is everywhere differentiable with respect to $\vct{x}$ and the gradients can be tractably computed.

### Metropolis-Hastings - quick recap

Define a proposal density $q(\vct{x}' \gvn \vct{x})$ we can tractably sample from, generate a sample from it given the current state and then accept the proposal with probability

$$
  a_{\rvct{x}^{(t+1)} \gvn \rvct{x}^{(t)}} \lsb \vct{x}' \gvn \vct{x} \rsb =
  \min \lbr 1, 
    \frac{q(\vct{x} \gvn \vct{x}') \ivdst{\rvct{x}}{\vct{x}'}} 
         {q(\vct{x}' \gvn \vct{x}) \ivdst{\rvct{x}}{\vct{x}}} \left|\pd{\vct{x}'}{\vct{x}}\right| \rbr
$$

For derivation and explanation of Jacobian term in acceptance ratio see [Green (1995)](http://biomet.oxfordjournals.org/content/82/4/711.short) or [Lan et al. (2012)](http://arxiv.org/abs/1211.3759).

If proposal density is symmetric: $q(\vct{x}' \gvn \vct{x}) = q(\vct{x} \gvn \vct{x}') ~~\forall \vct{x},~ \vct{x}' \in \reals^N$ and $\ivdstsym_\rvct{x}$ is as defined above then the acceptance probability reduces to

$$
  a_{\rvct{x}^{(t+1)} \gvn \rvct{x}^{(t)}} \lsb \vct{x}' \gvn \vct{x} \rsb =
  \min \lbr 1, \exp\lsb \phi(\vct{x}) - \phi(\vct{x}') \rsb \left|\pd{\vct{x}'}{\vct{x}}\right| \rbr
$$

Key problem is finding proposal density that allows proposes 'large' moves with high probability of acceptance.

## Abstract description of   
## Hamiltonian Monte Carlo

### Augment state space

Augment state space with a vector 'momentum' variable $\rvct{p} \in \reals^N$ and a signed 'time direction' variable $\rvar{d} \in \fset{-1,+1}$ with conditional invariant distribution densities

$$
  \ivdst{\rvct{y}\gvn\rvct{x}}{\vct{p}\gvn\vct{x}} \propto \exp\lbr - \tau(\vct{x}, \vct{p}) \rbr
  \text{  and  }
  \ivdst{\rvar{d}\gvn\rvct{x},\rvct{p}}{d | \vct{x}, \vct{p}} = \frac{1}{2}
$$

giving joint invariant density

\begin{align}
  \ivdst{\rvct{x},\,\rvct{p},\,\rvar{d}}{\vct{x},\,\vct{p},\,d} &\propto 
  \exp\lbr - \phi(\vct{x}) - \tau(\vct{x}, \vct{p}) \rbr \\
  &=
  \exp\lbr - H(\vct{x}, \vct{p}) \rbr
\end{align}

### Hamiltonian dynamic in augmented state space

If $\mtx{S} = -\mtx{S}\tr \in \reals^{2N\times 2N}$ is a constant non-singular skew-symmetric matrix then we can define a Hamiltonian dynamic on the joint system $\vct{z} = \lsb \vct{x}\tr ~~ \vct{p}\tr \rsb\tr$ by

$$
  \td{\vct{z}}{t} = d \, \mtx{S} \pd{H}{\vct{z}} ~~\Leftrightarrow~~
  \left[
    \begin{array}{c}
      \td{\vct{x}}{t} \\ 
      \td{\vct{p}}{t}
    \end{array}
  \right]
  =
  d \, \mtx{S}
  \left[
    \begin{array}{c}
      \pd{H}{\vct{x}} \\ 
      \pd{H}{\vct{p}}
    \end{array}
  \right]
$$

with flow map $\Psi_{T,d}\lbr \vct{z}_0 \rbr = \vct{z}(T)$ defined by the solution $\vct{z}(T)$ to the initial value problem

$$
  \td{\vct{z}}{t} = \pd{H}{\vct{z}}, ~~
  \vct{z}(0) = \vct{z}_0, ~~
  t \in [0, T].
$$

Typically $
  \mtx{S} = 
  \left[
    \begin{array}{cc}
      \mtx{0} & \mtx{I} \\ 
      -\mtx{I} & \mtx{0}
    \end{array}
  \right]
$ in which case the dynamic is *canonical*.

### Properties of dynamic

This dynamic

  * exactly preserves the Hamiltonian $H(\vct{z})$
  
  $$
    \td{H}{t} = \pd{H}{\vct{z}}\tr \td{\vct{z}}{t} = d \pd{H}{\vct{z}}\tr \mtx{S} \pd{H}{\vct{z}} = 0
  $$
  
  * preserves volume as flow is divergence-free  
  (Liouville's theorem)
  
  $$
    \lpa \pd{}{\vct{z}} \rpa\tr \td{\vct{z}}{t} = \Tr \lsb \mtx{S} \pdd{H}{\vct{z}}{\vct{z}\tr} \rsb = 0
    ~~\Rightarrow~~
    \pd{\Phi_{T,d}}{\vct{z}} = \mtx{I}
  $$

  * is reversible under negation of $\rvar{d}$
  
  $$
    \text{If } \vct{z}' = 
    \Psi_{T,+1} \lbr \vct{z} \rbr
    \text{ then } \vct{z} = 
    \Psi_{T,-1} \lbr \vct{z}' \rbr.
  $$

The dynamic also has the further property of being *symplectic map* with respect to the *structure matrix* $\mtx{S}$

$$
    \pd{\Psi_{T,d}\lbr\vct{z}\rbr}{\vct{z}}\tr \mtx{S}^{-1} 
    \pd{\Psi_{T,d}\lbr\vct{z}\rbr}{\vct{z}} = \mtx{S}^{-1}
$$

see for example [Leimkuhler and Reich (2005)](http://ebooks.cambridge.org/ebook.jsf?bid=CBO9780511614118) for proof and more details. 

Symplecticness implies volume preservation but is a more stringent requirment for $N > 1$.

### Exact Hamiltonian Monte Carlo

If we therefore define a proposal density

$$
  q(\vct{z}', d' \gvn \vct{z}, d) = \delta \lpa \Psi_{T,d}\lbr \vct{z} \rbr - \vct{z}' \rpa \delta \lsb d - (-d') \rsb
$$

i.e. propose new state by deterministically running Hamiltonian dynamics forward $T$ units of time then reverse time flow, then our acceptance probability will be unity

$$
  \min \lbr 1, \exp\lsb H(\vct{z}') - H(\vct{z}) \rsb \left|\pd{\vct{z}'}{\vct{z}}\right| \rbr = 1.
$$

We can then deterministically flip $\rvar{d}$ (so on the next proposal we won't go back to our previous point) as 

$$
  \ivdst{\rvar{d}\gvn\rvct{x},\rvct{p}}{+1 | \vct{x}, \vct{p}} = 
  \ivdst{\rvar{d}\gvn\rvct{x},\rvct{p}}{-1 | \vct{x}, \vct{p}} = \frac{1}{2}
$$ 

and so this move also leaves the joint density invariant. 

This composition of transitions will *not* be ergodic however in joint state space as we remain confined to the same constant Hamiltonian manifold.

## Moving to a concrete implementation

### Simulating Hamiltonian dynamics in practice

  * In reality for most systems of interest we cannot compute the flow map $\Psi_{T,d}$ exactly and so have to resort to discretisation and numerical integration.
  
  * Importantly there are numerical integration schemes which define an approximate flow map $\tilde{\Psi}_{T,d}$ which conserve the volume-preservation and reversibility properties of the exact dynamic $\Psi_{T,d}$.

  * In general Hamiltonian no longer exactly conserved under discretisation so will be some rejections.
  * There is a class of integrators which also preserve the symplectic map property of the exact dynamic.
 

  * Symplectic integrators have a very desirable further invariance property: 
    * providing the discretised dynamic is stable they exactly integrate the dynamic of some alternative 'nearby' Hamiltonian
    * this is bounded to be within a fixed distance (depending on $\delta t$ the discretisation time step) of the original Hamiltonian.
    * therefore still possible to integrate dynamics over long time periods with high probability of acceptance.

### Standard (‘Euclidean manifold’) HMC

The standard (and original) implementation of HMC augments the system with variables which are independent of the original state and have a Gaussian conditional / marginal

$$
  \ivdst{\rvct{p}\gvn\rvct{x}}{\vct{p}\gvn\vct{x}} =
  \ivdst{\rvct{p}}{\vct{p}} \propto
  \exp \lbr - \underbrace{ \frac{1}{2}\vct{p}\tr\mtx{M}^{-1}\vct{p}}_{\tau(\vct{p})} \rbr
$$

The derivative $\pd{\tau}{\vct{p}} = \mtx{M}^{-1}\vct{p}$ is now just a linear transform of the $\vct{p}$ variables which can be considered in analogy to Newtonian mechanics momentum variables with $\mtx{M}$ a mass matrix.

$$
    \mtx{M} \td{^2\vct{x}}{t^2} = \vct{f}(\vct{x}) ~\Leftrightarrow~
    \td{\vct{x}}{t} = \mtx{M}^{-1} \vct{p} ~~~
    \td{\vct{p}}{t} = \vct{f}({x}) = -\pd{\phi}{\vct{x}}
$$

In this case as the distribution on $\rvct{p}$ is symmetric there is no need to add a further binary direction variable $\rvar{d}$ as reversibility can be achieved by negating the momentum variables (which leaves there density invariant).

$$ \tau(\vct{p}) = \tau(-\vct{p}) = \frac{1}{2}\vct{p}\tr\mtx{M}^{-1}\vct{p} $$

Further as $\ivdst{\rvct{p}\gvn\rvct{x}}{\vct{p}\gvn\vct{x}}$ is Gaussian we can easily resample the momentum variables between dynamic proposal updates to alter the energy of the system and ensure ergodicity.

### Leapfrog updates
  
For standard HMC the Hamiltonian is *separable* (no terms coupling $\vct{x}$ and $\vct{p}$) for which under a canonical dynamic
  
$$
  \left[
    \begin{array}{c}
      \td{\vct{x}}{t} \\ 
      \td{\vct{p}}{t}
    \end{array}
  \right]
  =
  \left[
    \begin{array}{cc}
      \mtx{0} & \mtx{I} \\ 
      -\mtx{I} & \mtx{0}
    \end{array}
  \right]
  \left[
    \begin{array}{c}
      \pd{H}{\vct{x}} \\ 
      \pd{H}{\vct{p}}
    \end{array}
  \right]
  =
    \left[
    \begin{array}{c}
      \mtx{M}^{-1}\vct{p} \\ 
      -\pd{\phi}{\vct{x}}
    \end{array}
  \right]
$$

there is a particularly efficient symplectic integration scheme called the leapgfrog method composed of two step types

$$
  \left[
    \begin{array}{c}
      \vct{x}^\star \\ 
      \vct{p}^\star
    \end{array}
  \right]
  =
  \Phi^{A}_{\delta t}
  \left[
    \begin{array}{c}
      \vct{x} \\ 
      \vct{p}
    \end{array}
  \right]
  =
  \left[
    \begin{array}{c}
      \vct{x} \\
      \vct{p} - \delta t \pd{\phi}{\vct{x}}
    \end{array}
  \right]
  \\
  \left[
    \begin{array}{c}
      \vct{x}' \\ 
      \vct{p}'
    \end{array}
  \right]
  =
  \Phi^{B}_{\delta t}
  \left[
    \begin{array}{c}
      \vct{x} \\ 
      \vct{p}
    \end{array}
  \right]
  =
  \left[
    \begin{array}{c}
      \vct{x} + \delta t \mtx{M}^{-1} \vct{p} \\
      \vct{p} \\
    \end{array}
  \right]
$$

Individually each of these steps is volume preserving

$$
  \left| \pd{\Phi^A_{\delta t}}{\vct{z}} \right| =
  \left|
    \begin{array}{cc}
      \pd{\vct{x}^\star}{\vct{x}} & \pd{\vct{x}^\star}{\vct{p}} \\ 
      \pd{\vct{p}^\star}{\vct{x}} & \pd{\vct{p}^\star}{\vct{p}} \\ 
    \end{array}
  \right|
  =
  \left|
    \begin{array}{cc}
      \mtx{I} & \mtx{0} \\ 
      -\delta t\pdd{\phi}{\vct{x}}{\vct{x}\tr} & \mtx{I}
    \end{array}
  \right| = 1
  \\
  \left| \pd{\Phi^B_{\delta t}}{\vct{z}} \right| =
  \left|
    \begin{array}{cc}
      \pd{\vct{x}'}{\vct{x}} & \pd{\vct{x}'}{\vct{p}} \\ 
      \pd{\vct{p}'}{\vct{x}} & \pd{\vct{p}'}{\vct{p}} \\ 
    \end{array}
  \right|
  =
  \left|
    \begin{array}{cc}
      \mtx{I} & \delta t \mtx{M}^{-1} \\ 
      \mtx{0} & \mtx{I}
    \end{array}
  \right| = 1
$$

and therefore any composition of them is also. In particular symmetric compositions of the form

$$ \Phi^{A}_{\frac{1}{2}\delta t} \circ \Phi^{B}_{\delta t} \circ \Phi^{A}_{\frac{1}{2}\delta t} $$

are also time reversible and symplectic (therefore 'nearby' Hamiltonian exactly conserved).

Overall this gives a single step of leapfrog dynamics as

\begin{align}
  \vct{p}^\star &= \vct{p} - \frac{1}{2} \delta t \left. \pd{\phi}{\vct{x}} \right|_{\vct{x}}\\
  \vct{x}' &= \vct{x} + \delta t \, \mtx{M}^{-1} \vct{p}^\star \\
  \vct{p}' &= \vct{p}^\star - \frac{1}{2} \delta t \left. \pd{\phi}{\vct{x}} \right|_{\vct{x}'}\\
\end{align}

In practice tend to combine half steps after initial one

$$
  \lbr \Phi^{A}_{\frac{1}{2}\delta t} \circ \Phi^{B}_{\delta t} \circ \Phi^{A}_{\frac{1}{2}\delta t} \rbr
  \lbr \Phi^{A}_{\frac{1}{2}\delta t} \circ \Phi^{B}_{\delta t} \circ \Phi^{A}_{\frac{1}{2}\delta t} \rbr
  \dots
  \lbr \Phi^{A}_{\frac{1}{2}\delta t} \circ \Phi^{B}_{\delta t} \circ \Phi^{A}_{\frac{1}{2}\delta t} \rbr 
  =\\
  \Phi^{A}_{\frac{1}{2}\delta t} \circ \Phi^{B}_{\delta t} \circ \Phi^{A}_{\delta t}
  \circ \Phi^{B}_{\delta t} \circ \Phi^{A}_{\delta t}
  \dots
  \circ \Phi^{B}_{\delta t} \circ \Phi^{A}_{\frac{1}{2}\delta t}
$$


### Resampling momenta to ensure ergodicity

Metropolis-Hastings updates with Hamiltonian dynamics proposals alone will not generally ensure ergodicity - constrained to near constant Hamiltonian surface.

Overcome by alternating with a different Markov transition operator which leaves joint distribution invariant.

In particular we can use any transition which leaves the conditional on the momenta given the positions invariant (c.f. Gibbs sampling). In case of standard HMC, momenta are independent of positions therefore resample independently from Gaussian distribution.

More general update with partial momentum refreshal from Horowitz (1991):

$$
    \vct{p}' = \cos\theta ~ \vct{n} + \sin\theta ~ \vct{p}
    \qquad
    \text{with }
    \vct{n} \sim \nrm{\cdot; \vct{0}, \mtx{M}}
$$

## References and further reading

  * [**Hybrid Monte Carlo**, Duane et al. (1987)](http://www.sciencedirect.com/science/article/pii/037026938791197X)  
    *Paper which introduced HMC (with its original name)*
  * [**A Generalized Guided Hybrid Monte Carlo Algorithm**, Horowitz (1991)](http://www.sciencedirect.com/science/article/pii/0370269391908125)  
    *Original description of partial momentum refreshing*
  * [**Simulating Hamiltonian Dynamics**, Leimkuhler and Reich (2005)](http://ebooks.cambridge.org/ebook.jsf?bid=CBO9780511614118)  
    *Good reference textbook for details of properties of Hamiltonian dynamics.*
  * [**MCMC Using Hamiltonian Dynamics**, Neal (2012)](http://arxiv.org/abs/1206.1901)  
    *Extensive review of Hamiltonian Monte Carlo and various practical implementation issues.*