# Holographic Pixel Graph Neural Network

In [2]:
%run "model.py"

## Motivation

Generative models have been applied to solve Statistical Mechanics problems by learning to model the Boltzmann distribution of microscopic configurations given the energy functional. This project aims to explore the combination of both **flow-based** and **autoregressive** models together with ideas from **renormalization group** and **holography** to tackle **critical** (scale-free) spin systems.

### Technical Goals

* Develop **hierachical flow-based and autoregressive models**, combining them in the hyperbolic space. (For hierachical flow see [NeuralRG](https://arxiv.org/pdf/1802.02840.pdf) and related works)
  * Hierachical flow learns the holographic mapping (i.e. wavelet transformation) that brings spin configurations to their wavelet encodings (like Haar wavelet encodings) in the holographic bulk.
  * Autoregressive model learns the base distribution of wavelet encodings in the holographic bulk instead of the spin configurations on the holographic boundary.

* Generalize [descrete flow](https://arxiv.org/pdf/1905.10347.pdf) and [integer flow](https://arxiv.org/pdf/1905.07376.pdf) to generic **non-Abelian descrete groups** (previously proposed XOR and Mod-K transforms are $\mathbb{Z}_2$ and $\mathbb{Z}_K$ groups). Some future ideas [*Not implemented yet, the current project will use a fixed transformation (e.g. Haar wavelet)*]:
  * Foward computation allow two group elements to perform a controlled transform, such as $(g_1,g_2)\to(g_1,g_1 g_2)$ (can be left- or right-multiplication, inversion, conjugation ...). Non-Abelian group may provide sufficient scrambling. Abelian group can be embedded in non-abelian groups.
  * Fuzzy group: focus on distribution of group elements which is contineous and can flow. Similar to the idea of Gumbel-softmax.
  
* Develop more flexible autoregressive model based on message passing on **directed causal graphs** in the holographic space using graph convolutional network (GCN) techniques. This will allow us to apply autoregressive model in the holographic bulk, where the causal relations depends on the flow-based model and can be quite involved. GCN provides us the flexibility to dynamically determine the neural network connectivity.

### Scientific Goals

* **Analyze the RG Flow (Down Sampling)**: autoregressive model is good at marginalize causal dependants, which correspond to the UV degree of freedom. This can be used to analyze the RG flow after training. 
* **Analyze the Scaling Behavior (Up Sampling)** If we impose parameter sharing across RG scales, the model could learn a scale invariant transformatioin rule that can be genralize to larger systems. This will allow us to perform finite-size scaling analysis by up sampling.
* **Probe Operator Scaling Dimension**: using the wormhole idea in the holographic space by resampling UV latent variables. This provide us a novel approach to obtain scaling dimension without a notion of spacetime.
* **Speed up Monte Carlo(?)**: Need to measure the dynamic exponent to see.
* **Application - $S_N$ Models**: $S_N$ spin model have $N!$ spin states (corresponding to $S_N$ group elements) on each site, which grows with $N$ quickly. Local update will be inefficient in this case, will hierachical autoregressive model be more efficient? These models will be important for us to understand **entanglement transitions** in random quantum circuits.

## Model Design

### Architecture Overview

The model consists of the following parts:

<img src="./image/model.png" alt="model" width="360"/>

* A generative model $p(x)$ consist of
  * A base model $p(z)$ realized as an **autoregressive model**, which uses graph convolutional network techniques to compute conditional distributions on a directed causal graph.
  * A stack of transformations containing
     * A **bijective encoding** (beetween one-hot and categorical)
     * A **renormalization group (RG) transformation** realized as a flow model (but currently fixed to be Haar wavelet transformation in this project).
* An **energy model** $E(x)$ must be provided as input to drive the training.
* All these modules are based on information provided by the infrastructure layer which contains:
  * A **group model** to provide basic functions of group operation and group function evaluation.
  * A **lattice model** to provide indexing of nodes and to construct the causal graph in the holographic bulk.
  
Finally, the model is trained to minize the variational free energy.

### Lattice

`Lattice` represents a $d$-dimensional regular grid lattice of size $L$ (containing totally $L^d$ sites).

**Parameters:**
- `size` (int) - lattice size $L$
- `dimension` (int) - lattice dimension $d$

**Properties:**
- `sites` (int) - number of lattice sites, equals to $L^d$.
- `nodes` (list) - containing *latent* and *physical* nodes in the lattice system.
  - **Latent nodes**: $L^d$  nodes in the holographic bulk, hosting latent variables.
  - **Physical nodes**: $L^d$ nodes on the holographic boundary, hosting physical variables.
  - Each node is represented by a `Node` object.

Example: create a 2D lattice

In [2]:
latt = Lattice(4, 2)

Lattice(4x4 grid)

The Lattice object hosts a list of nodes. (altogether $2L^d$ nodes)

In [3]:
len(latt.nodes)

32

The renormalization group transformation will be organized on a binary tree, which coarse grains along one dimension in each step. The coarse graining direction cycles through all dimensions repeatedly. The information flow forms a [H-tree](https://en.wikipedia.org/wiki/H_tree) fractal. 
- *Internal* (branch) nodes of the binary tree = *latent* nodes.
- *External* (leaf) nodes of the binary tree = *physical* nodes. Each physical node also carries a physical site index.

<img src="./image/htree.png" alt="index_systems" width="180"/>

#### Node

`Node` represents a single node in the lattice. 

**Properties:**
- `type` (str) - node type
  - `'lat'` - latent node,
  - `'phy'` - physical node.
- `ind` (int) - node index.
- `center` (Tensor) - $(x,y)$ coordinate projected to the boundary coordinate system.
- `generation` (int) - the generation of node.
- `parent` (Node) - point to the parent node (except for node 0).
- `children` (list of Nodes) - (*for latent node only*) a pair of children nodes [ch1, ch2].
- `site` - (*for physical node only*) site index of the physical node.

Example: node properties

In [5]:
latt = Lattice(4, 2)
print('ind type gen  children parent site')
for node in latt.nodes:
    print('{:3d} {:>4s} {:3d} {:>9s} {:>6s} {:>4s}'.format(
        node.ind, node.type, node.generation, 
        ','.join(str(c.ind) for c in node.children if c is not None), 
        str(node.parent.ind) if node.parent is not None else '', str(node.site) if node.site is not None else ''))

ind type gen  children parent site
  0  lat   0         1            
  1  lat   0       2,3      0     
  2  lat   1       4,6      1     
  3  lat   1       5,7      1     
  4  lat   2      8,12      2     
  5  lat   2      9,13      3     
  6  lat   2     10,14      2     
  7  lat   2     11,15      3     
  8  lat   3     16,24      4     
  9  lat   3     17,25      5     
 10  lat   3     18,26      6     
 11  lat   3     19,27      7     
 12  lat   3     20,28      4     
 13  lat   3     21,29      5     
 14  lat   3     22,30      6     
 15  lat   3     23,31      7     
 16  phy   4                8    0
 17  phy   4                9    8
 18  phy   4               10    2
 19  phy   4               11   10
 20  phy   4               12    4
 21  phy   4               13   12
 22  phy   4               14    6
 23  phy   4               15   14
 24  phy   4                8    1
 25  phy   4                9    9
 26  phy   4               10    3
 27  phy   4        

#### Haar Wavelet

On the lattice structure, we can define:
- The **shadow** of a node = the collection of physical nodes achievable from the node. It can be constructed by combining the shadow of children nodes recurrently.
$$S(i) =\left\{\begin{array}{cc}\bigcup_{j\in \text{ch}(i)} S(j) & \text{ch}(i)\neq \{\}, \\
\{i\} & \text{ch}(i)= \{\}, \\\end{array}\right.$$
- The **action range** of a latent node = the shadow of the 2nd child of the node, i.e. $A(i) = S(\text{ch}_2(i))$.

Each latent node $i$ labels a **Haar wavelet**, whose wave form  = one-hot encoding of the action range $A(i)$ of the node $i$. 
Define
- **Decoding matrix** $D$, such that $D_{ij}=\delta(i\in A(j))$. $D$ is a matrix of $0,1$. $D$ is also modular (i.e. $\det D =1$).
- **Encoding matrix** $E$, $E=D^{-1}$. Because $D$ is modular, $E$ is also modular. In fact, it is a matrix of $0,\pm1$.

With these matrices, the *Haar wavelet* **encoding** and **decoding** map can be defined as
$$\text{encode: }z_a=\prod_i x_i^{E_{a i}},\quad\text{decode: } x_i=\prod_a z_a^{D_{i a}}.$$

Example: action sites.

In [6]:
latt = Lattice(2, 2)
print('ind    action_sites')
for node in latt.nodes:
    if node.type is 'lat':
        print('{:3d} -> {}'.format(node.ind, node.action_sites()))

ind    action_sites
  0 -> [0, 1, 2, 3]
  1 -> [2, 3]
  2 -> [1]
  3 -> [3]


`wavelet_maps()` method provides the encoding and decoding maps (i.e $E$ and $D$ matrices).

In [7]:
latt.wavelet_maps()

(tensor([[ 1,  0,  0,  0],
         [-1,  0,  1,  0],
         [-1,  1,  0,  0],
         [ 0,  0, -1,  1]]),
 tensor([[1, 0, 0, 0],
         [1, 0, 1, 0],
         [1, 1, 0, 0],
         [1, 1, 0, 1]]))

#### Causal Graph

The autoregressive model uses conditional probabilities to model the joint probability. Each conditional probability entails the underlying causal influence that a random variable will receive from its conditional variables. The causal relations form a *directed graph*, called the **causal graph**. Since we have mapped the RG transform to a binary tree universally, we only need to analyze the causal relations on the tree.

<img src="./image/causal_graph.png" alt="causal_graph" width="400"/>

On the lattice structure, we define:
- The **center** coordinate $\vec{r}_i$ of node $i$: projection of the node to the holographic boundary.
- The **generation** $g_i$ of node $i$: the graph distance of the node on the binary tree from the tree root. The root node has generation $g_i=0$, its children are of generation $g_i=1$, and so on.
- The **vicinity set** $V_\rho(i)$ of a given node $i$ includes the nodes whose centers are within a scaled radius from the given node. (When calculating distance, the peroidic boundary condition of the physical space is assumed.)
$$V_\rho(i)=\{j|\Vert\vec{r}_j-\vec{r}_i\Vert /L < \rho 2^{-g_i/d}\},$$
  The dimensionless parameter $\rho$ controls the (relative) radius of the vicinity set.
- The **relevant nodes** $R_\rho(i)$ is the union of ancestors of all nodes on $V_\rho(i)$, 
$$R_\rho(i) = \bigcup_{j\in V_\rho(i)}\text{anc}(j).$$
  Physically, the relevant nodes are the nodes in the past light-cone of the vicinity set of a node.


In [8]:
latt = Lattice(4, 2)
print('ind relevant_nodes')
for node in latt.nodes[1:]:
    if node.type is 'lat':
        print('{:3d} {}'.format(node.ind, [node.ind for node in latt.relevant_nodes(node, 1.5)]))

ind relevant_nodes
  1 []
  2 [1]
  3 [2, 1]
  4 [2, 1, 3]
  5 [2, 1, 3, 4]
  6 [5, 2, 3, 4, 1]
  7 [5, 2, 6, 3, 4, 1]
  8 [5, 2, 6, 3, 4, 1]
  9 [8, 7, 5, 2, 3, 4, 1]
 10 [8, 7, 2, 6, 3, 4, 1]
 11 [7, 5, 2, 9, 10, 6, 3, 1]
 12 [8, 5, 2, 9, 6, 3, 4, 1]
 13 [8, 7, 5, 2, 9, 3, 4, 12, 1]
 14 [7, 2, 10, 6, 11, 3, 4, 12, 1]
 15 [7, 5, 2, 10, 6, 11, 3, 13, 1, 14]


The **causal graph** is a graph of all latent nodes. For every pair of nodes $(j,i)$ with $j\in R_\rho(i)$, a *directed edge* $j\to i$ is assigned from $j$ to $i$. Each edge $j\to i$ can be classified by the relationship between $j$ and $i$. The relatioship between $j$ and $i$ is characterized by the pair of integers $(g_{j}-g_{(j,i)},g_{i}-g_{(j,i)})$, where $g_{(j,i)}=\max_{k\in \text{anc}(j)\cup\text{anc}(i)}g_k$.

`causal_graph(radius=1.)` method returns the causal graph as a Graph object.

In [9]:
graph = latt.causal_graph()
graph.type_dict, graph.adjacency_matrix().to_dense()

({1: (0, 1), 2: (0, 2), 3: (0, 3), 4: (1, 1), 5: (1, 3), 6: (2, 3), 7: (3, 3)},
 tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
         [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
         [0, 1, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
         [0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
         [0, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
         [0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
         [0, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
         [0, 3, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
         [0, 3, 0, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
         [0, 3, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
         [0, 3, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
         [0, 3, 2, 5, 1, 6, 0, 0, 4, 7, 0, 0, 0, 0, 0, 0],
         [0, 3, 5, 2, 6, 1, 0, 0, 7, 4, 0, 0, 0, 0, 0, 0],
         [0, 3, 2, 5, 0, 0, 1, 6, 0, 0, 4, 7, 0, 0, 0, 0],
         [0, 3, 5, 2, 0, 0, 6, 1, 0

Increasing the radius will include more edges to the causal graph.

In [10]:
latt.causal_graph(1.), latt.causal_graph(1.5), latt.causal_graph(2.)

(Graph(16x16, 51 edges of 7 types),
 Graph(16x16, 85 edges of 9 types),
 Graph(16x16, 101 edges of 9 types))

### Graph

`Graph` represents a directed graph and provides methods to extend nodes and generate weighed adjacent matrices in sparse form. The graph contains 
- $N_t$ number of target nodes, 
- $N_s$ number of source nodes,
- $N_e$ number of edges from source nodes to target nodes.

The edge set $\{j_k\xrightarrow{e_k}i_k|k=0:N_e\}$ can be encoded by the indices $[[i_0,i_1,\cdots],[j_0,j_1\cdots]]$ and the edge types $[e_0,e_1,\cdots]$.

**Parameters:**

- `dims` (tuple) - $(N_t, N_s)$ record the numbers of target and source nodes.
- `indices` (LongTensor) - $2\times N_e$ dimensional tensor specifying the indices of the target nodes (in row-0) and the source nodes (in row-1).
- `edge_types` (LongTensor) - $N_e$ dimensional tensor specifying the type of the corresponding edge.
- `source_depths` (LongTensor, optional) - $N_s$ dimensional tensor specifying depth assignment of source nodes.

**Properties:**

- `edge_depths` (LongTensor) - $N_e$ dimensional tensor specifying depth assignment of edges, which follows the depth assignment of the source node of each edge.
- `max_depth`(int) - maximal depth.
- `max_edge_type` (int) - maximal edge type.

Example: construct a graph and show the adjacency matrix.

In [11]:
graph = Graph((4,4), torch.tensor([[2,3,3],[1,1,2]]), torch.tensor([1,1,2]))
graph.adjacency_matrix().to_dense()

tensor([[0, 0, 0, 0],
        [0, 0, 0, 0],
        [0, 1, 0, 0],
        [0, 1, 2, 0]])

#### Depths

If `source_depths` is not given, it will be calculated. Note that source depth can only be calculated when $N_t=N_s$.

**Algorithm:**

Let $A$ be the adjacency matrix. $u$ is an one-hot vector encoding the active vertices. Initially, $u$ is set to all one. $d$ is the depth vector, initially assigned to all zero.
  - $u'= \text{bool}(A u > 0)$ gives the target vertices under adjacency map,
  - if $\Vert u'\Vert_1=\Vert u\Vert_1$ stop, otherwise $d=d+u'$.

In [12]:
graph.source_depths

tensor([0, 0, 1, 2])

from which the maximal depth is

In [13]:
graph.max_depth

2

`edge_depths` can be inferred from `source_depths` directly given `indices`. Edge depth = depth of the source node of the edge.

In [14]:
graph.edge_depths

tensor([0, 0, 1])

#### Add Self-Loop

`add_self_loops(start=0)` method prepends self loops to the graph. It returns a new Graph object.

In [15]:
graph_sl = graph.add_self_loops()
graph_sl.adjacency_matrix().to_dense()

tensor([[1, 0, 0, 0],
        [0, 1, 0, 0],
        [0, 2, 1, 0],
        [0, 2, 3, 1]])

The starting node for the self loop to be added can be specified by the `start` parameter. This can be used to avoid adding self loops to the first node (as the first latent node corresponds to the global symmetry and is a special node).

In [16]:
graph_sl = graph.add_self_loops(start=1)
graph_sl.adjacency_matrix().to_dense()

tensor([[0, 0, 0, 0],
        [0, 1, 0, 0],
        [0, 2, 1, 0],
        [0, 2, 3, 1]])

#### Expand

`expand(target_dim, source_dim)` method extends the graph by the specified target $d_t$ and source $d_s$ dimensions, such that $(N_t, N_s)\to(N_t d_t, N_s d_s)$ and $N_e\to N_e d_t d_s$. It will return a new Graph object. The extended edges are indexed by $(k,a,b)$ with $k=0:N_e$, $a=0:d_t$, $b=0:d_s$,
$$\begin{split}&(j,b)\xrightarrow{(e,a,b)} (i,a)\\ \Rightarrow & d_s j + b\xrightarrow{((e-1)d_t + a)d_s + b +1} d_t i + a\end{split}$$.

The source depth assignments will simply be repeated. The edge depth assignment will be reconstructed according to the new indices and source depth assignments.

In [17]:
graph_ext = graph.expand(2, 3)
graph_ext.source_depths, graph_ext.adjacency_matrix().to_dense()

(tensor([0, 0, 0, 0, 0, 0, 1, 1, 1, 2, 2, 2]),
 tensor([[ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
         [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
         [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
         [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
         [ 0,  0,  0,  1,  2,  3,  0,  0,  0,  0,  0,  0],
         [ 0,  0,  0,  4,  5,  6,  0,  0,  0,  0,  0,  0],
         [ 0,  0,  0,  1,  2,  3,  7,  8,  9,  0,  0,  0],
         [ 0,  0,  0,  4,  5,  6, 10, 11, 12,  0,  0,  0]]))

Self loops can be added and then expand.

In [18]:
graph_ext = graph.add_self_loops(start=1).expand(2, 3)
graph_ext.source_depths, graph_ext.adjacency_matrix().to_dense()

(tensor([0, 0, 0, 0, 0, 0, 1, 1, 1, 2, 2, 2]),
 tensor([[ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
         [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
         [ 0,  0,  0,  1,  2,  3,  0,  0,  0,  0,  0,  0],
         [ 0,  0,  0,  4,  5,  6,  0,  0,  0,  0,  0,  0],
         [ 0,  0,  0,  7,  8,  9,  1,  2,  3,  0,  0,  0],
         [ 0,  0,  0, 10, 11, 12,  4,  5,  6,  0,  0,  0],
         [ 0,  0,  0,  7,  8,  9, 13, 14, 15,  1,  2,  3],
         [ 0,  0,  0, 10, 11, 12, 16, 17, 18,  4,  5,  6]]))

#### Sparse Matrix

`sparse_matrix(vector, depth=None)` scatters the vector elements to the adjacency matrix according to the edge type. The sparse matrix will be on the same device as the given vector.

In [19]:
vector = (torch.rand(graph_ext.max_edge_type)*100).round().to(device)

tensor([ 5., 56., 27., 84., 67., 84., 73., 60., 39., 86., 47., 56., 50., 84.,
        22., 30., 69., 95.], device='cuda:0')

Without specific depth.

In [20]:
graph_ext.sparse_matrix(vector).to_dense()

tensor([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  5., 56., 27.,  0.,  0.,  0.,  0.,  0.,  0.],
        [ 0.,  0.,  0., 84., 67., 84.,  0.,  0.,  0.,  0.,  0.,  0.],
        [ 0.,  0.,  0., 73., 60., 39.,  5., 56., 27.,  0.,  0.,  0.],
        [ 0.,  0.,  0., 86., 47., 56., 84., 67., 84.,  0.,  0.,  0.],
        [ 0.,  0.,  0., 73., 60., 39., 50., 84., 22.,  5., 56., 27.],
        [ 0.,  0.,  0., 86., 47., 56., 30., 69., 95., 84., 67., 84.]],
       device='cuda:0')

With a specific depth.

In [21]:
graph_ext.sparse_matrix(vector, 0).to_dense()

tensor([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  5., 56., 27.,  0.,  0.,  0.,  0.,  0.,  0.],
        [ 0.,  0.,  0., 84., 67., 84.,  0.,  0.,  0.,  0.,  0.,  0.],
        [ 0.,  0.,  0., 73., 60., 39.,  0.,  0.,  0.,  0.,  0.,  0.],
        [ 0.,  0.,  0., 86., 47., 56.,  0.,  0.,  0.,  0.,  0.,  0.],
        [ 0.,  0.,  0., 73., 60., 39.,  0.,  0.,  0.,  0.,  0.,  0.],
        [ 0.,  0.,  0., 86., 47., 56.,  0.,  0.,  0.,  0.,  0.,  0.]],
       device='cuda:0')

### Group

`Group` represents a discrete group $G$ specified by the multiplication table. Group elements will be labeled by integers (ranging from 0 to the order of the group). The element 0 is always treated as the identity element of the group. 

**Properties:**
- `order` (int) - group order $|G|$ = number of group elements.
- `mul_table` (Parameter) - group multiplication table, as a $|G|\times|G|$ dimensional matrix.
- `inv_table` (Parameter) - group inversion table, as a $|G|$ dimensional vector.

#### Group Algebra

Create a $S_3$ group.

In [23]:
G = SymmetricGroup(3)

Group(order=6)

Goup multiplcation and inversion tables.

In [24]:
G.mul_table, G.inv_table

(tensor([[0, 1, 2, 3, 4, 5],
         [1, 0, 4, 5, 2, 3],
         [2, 3, 0, 1, 5, 4],
         [3, 2, 5, 4, 0, 1],
         [4, 5, 1, 0, 3, 2],
         [5, 4, 3, 2, 1, 0]]),
 tensor([0, 1, 2, 4, 3, 5]))

Multiplying two tensors element-wise following the group multiplication rule.

In [25]:
a = torch.tensor([[0, 1, 2], [3, 4, 5]]).to(device)
b = torch.tensor([[5, 4, 3], [2, 1, 0]]).to(device)
G.mul(a, b)

tensor([[5, 2, 1],
        [5, 5, 5]], device='cuda:0')

Product of a tensor along a given axis. Axis specified by `dim`. Keep dimension can be set by `keepdim`.

In [26]:
G.prod(a, dim=0)

tensor([3, 2, 4], device='cuda:0')

In [27]:
G.prod(a, dim=-1, keepdim=True)

tensor([[4],
        [5]], device='cuda:0')

Group inversion of a tensor (element-wise).

In [28]:
G.inv(a)

tensor([[0, 1, 2],
        [4, 3, 5]], device='cuda:0')

The inversion can be alternatively implemented by negate the group elements and translate the "negative" elements to the legitimate range.

In [29]:
G.mod(-a)

tensor([[0, 1, 2],
        [4, 3, 5]], device='cuda:0')

#### Group Function

`forward(input, val_table)` method performs element-wise evaluation of the *group function* $f:G\to \mathbb{R}$ specified by a value table. The default value table for symmetric group corresponds to the cycle counting function.

In [30]:
G(a, G.default_val_table())

tensor([[3., 2., 2.],
        [1., 1., 2.]], device='cuda:0')

In [31]:
G(a, val_table=torch.tensor([1.,0.,0.,-0.5,-0.5,0.]))

tensor([[ 1.0000,  0.0000,  0.0000],
        [-0.5000, -0.5000,  0.0000]], device='cuda:0')

### Energy Model

`Model` represents an energy model to describe the statistical mechanical system. It provides the function to evalutate the energy of a configuration.

**Parameters:**
- `lattice` (Lattice) - lattice on which the model is defined.
- `group` (Group) - group of the on-site degree of freedom.

**Properties:**
- `energy` (EnergyTerms) - a list of energy terms. It provide a forward method to evalute the energy of a given configuration, provided the information of lattice and group.

#### Construction

Consider a 2D Ising model on a square lattice
$$H= -J \sum_{i}(\sigma_i\sigma_{i+\hat{x}} + \sigma_i\sigma_{i+\hat{y}}).$$
The Hamiltonian can be typed in as (see the following subsection for explaination of the notation)

In [32]:
H = lambda J: -J*(TwoBody([1,0],[1,-1]) + TwoBody([0,1],[1,-1]));

The Hamiltonian at this point is just an abstract notation. It must be combined with the specific `Lattice` and `Group` setup to form a concrete energy model. The energy model itself is a torch module (without any trainable parameters), which can be used to evaluate the energy of any spin configuration.

In [33]:
model = Model(H(0.5), Lattice(4, 2), SymmetricGroup(2)).to(device)

Model(
  (lattice): Lattice(4x4 grid)
  (group): Group(order=2)
  (energy): EnergyTerms(
    (0): TwoBody(G -> [-0.5, 0.5] across [1, 0])
    (1): TwoBody(G -> [-0.5, 0.5] across [0, 1])
  )
)

Let us generate some spin configurations.

In [34]:
x = torch.randint(2, (2, 4, 4)).to(device)

tensor([[[0, 0, 0, 0],
         [1, 1, 0, 0],
         [1, 0, 0, 1],
         [0, 0, 0, 1]],

        [[1, 0, 1, 1],
         [1, 0, 0, 0],
         [0, 0, 0, 0],
         [1, 1, 1, 1]]], device='cuda:0')

Evalutate the energy of these spin configuraitons by the energy model.

In [35]:
model(x)

tensor([-4., -4.], device='cuda:0')

#### Hamiltonian Scripting System

In order to facilitate the formulation of Hamiltonian intuitively, we introduce a scripting system. Physical Hamiltonians are always sum of local energy terms. Each energy term is a subclass of `nn.Module` and each Hamiltonian is a subclass of `nn.ModuleList` (which contains the collection of energy terms). In this way, the evaluation of the total energy of the Hamiltonian can be distributed to each energy term in parallel.

We introduce two kinds of energy terms
* `OnSite`: on-site energy term $E_1(g_i)$,
* `TwoBody`: two-body interaction term $E_2(g_i,g_j)$.

More types of interaction terms can be introduced under this framework if necessary. These energy terms are group functions: $E_1:G\to\mathbb{R}$, $E_2:G\times G\to\mathbb{R}$. These group functions can be specified by value tables, which enumerate the value that each group element maps to. For example, for the $\mathbb{Z}_2=\{0,1\}$ group ($0$-identity, $1$-flip), if we want to specify
$$E_1(0)=+1, E_1(1)=-1,$$
the value talbe is $[+1,-1]$. Such a term can be created as follows

In [36]:
OnSite([1,-1])

OnSite(G -> [1.0, -1.0])

We assume the two-body term always take the form of
$$E_2(g_i,g_j)=E_2(g_i^{-1}g_j),$$
such that we will only need to a single-variable group function, unsing the same value table representation. For example,

In [37]:
TwoBody([1,0],[1,-1])

TwoBody(G -> [1.0, -1.0] across [1, 0])

The the first argument of the two body term specify the relative direction from site-$i$ to site-$j$. If the value table is not specified, the default group function will be used:
* For generic `Group`, the default group function is the delta function (like Potts model), which maps the identity element to 1 and the others to 0.
* For `SymmetricGroup`, the default group function is the cycle counting function (count the number of permutation cycles).

We can add, subtract, scalar multiply and negate the energy terms. Energy terms adding together will be represented as a collection of terms in a list (`nn.ModuleList`), which corresponds to a Hamiltonian. For example

In [38]:
-2.8 * OnSite() + 5.2 * (TwoBody([1,0]) + TwoBody([0,1]))

EnergyTerms(
  (0): TwoBody(5.2 across [1, 0])
  (1): TwoBody(5.2 across [0, 1])
  (2): OnSite(-2.8)
)

### Haar Transformation

`HaarTransformation` is a bijective map between the spin configuration and the Haar wavelet encoding. It is used to realize a fixed version of the *invertible RG transform* (or the *holographic mapping*). In the future, it could be replaced by a trainable *descrete flow* model.
- `_call(z)`: Decoding Map (Generation Flow) takes the wavelet component $z$ to the spin configuration $x$ following 
$$x_i = \prod_a z_a^{D_{i a}}.$$
- `_inverse(x)`: Encoding Map (Renormalization Flow) takes the spin configuration $x$ to the wavelet component $z$ following 
$$z_a = \prod_i x_i^{E_{a i}}.$$
Note that $E$ contains $-1$ elements, which should be treated as group inversion.

**Parameters:**
- `lattice` (Lattice) - lattice on which the model is defined.
- `group` (Group) - group of the on-site degree of freedom.

**Properties**
- `encoding_mat` (LongTensor) - the encoding matrix $E$.
- `decoding_mat` (LongTensor) - the decoding matrix $D$.

#### Example

Consider a $S_3$ group. Generate some wavelet (latent) configuration

In [39]:
G = SymmetricGroup(3).to(device)
z = torch.randint(G.order, (16,)).to(device)

tensor([2, 0, 5, 4, 4, 0, 5, 4, 5, 5, 4, 2, 5, 4, 3, 0], device='cuda:0')

Transform the 16 wavelet components to the spin configuration on a 4 x 4 lattice.

In [40]:
ht = HaarTransform(Lattice(4, 2), G)
x = ht(z)

tensor([[2, 4, 4, 3],
        [5, 0, 2, 1],
        [2, 4, 5, 3],
        [2, 5, 1, 1]], device='cuda:0')

Transform back and verify that the encoder and decoder are inverse to each other.

In [41]:
ht.inv(x)

tensor([2, 0, 5, 4, 4, 0, 5, 4, 5, 5, 4, 2, 5, 4, 3, 0], device='cuda:0')

#### Bulk Effective Energy

Because Haar wavelet transform is a fixed transformation, the latent variables are bijectively related to the physical variable, so anything that we want to measure in the bulk (polarization, correlation ...) can in principle be measured in Monte Carlo, but transforming the data set. In fact, we even know the bulk energy function. For a 4x4 lattice, it looks like

$$\begin{split}
E&= \tau_4 + \tau_5 + \tau_6 +\cdots +\tau_{15}\\
&+\tau_1\tau_4+\tau_1\tau_6+\tau_2\tau_8+\tau_2\tau_{10}+\cdots\\
&+\tau_4\tau_8\tau_9+\tau_5\tau_{10}\tau_{11}+\cdots\\
&+\tau_2\tau_4\tau_5\tau_9+\cdots\\
&+\tau_1\tau_2\tau_3\tau_5\tau_{11}\tau_{14}+\cdots
\end{split}$$

The longest multi-spin interaction is of the order $\sim\ln L$ where $L$ is the system size. Such term arises from the two-body interaction of physical spins across two big trees. Our model should be deep enough to resolve the nontrivial conditional distribution due to these multi-spin interactions in the latent space. 



### One-Hot Categorical Transformation

`OneHotCategoricalTransform` is a **bijective embedding** that convert between the group elements and their one-hot embeddings. This serves as an interface between the RG transformation (which works with group elements for efficiency) and the autoregressive model (which works with one-hot embeddings for training performance).

#### Example

In [42]:
z_cat = torch.randint(6, (2, 3)).to(device)
z_cat

tensor([[1, 5, 0],
        [1, 1, 0]], device='cuda:0')

In [43]:
oc = OneHotCategoricalTransform(6)
z_emb = oc.inv(z_cat)
z_emb

tensor([[[0., 1., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 1.],
         [1., 0., 0., 0., 0., 0.]],

        [[0., 1., 0., 0., 0., 0.],
         [0., 1., 0., 0., 0., 0.],
         [1., 0., 0., 0., 0., 0.]]], device='cuda:0')

In [44]:
oc(z_emb)

tensor([[1, 5, 0],
        [1, 1, 0]], device='cuda:0')

### Graph Convolution Layer

`GraphConvLayer` provides a graph convolution layer, given the graph structure. It performs separate linear maps for different types of edges, and propagate message along the edge direction

$$y_a=\sum_{b\to a}w_{e(b\to a)}x_b+ b_{e(b\to a)},$$

where $e(b\to a)$ denotes the type of the edge $b\to a$ and the summation goes through all edges in the directed graph. $w$ and $b$ are trainable weights and biases that depend on the edge type. 

**Parameters:**
- `graph` (Graph) - causal graph (the unexpanded base graph) specifying the directed edges and edge types.
- `in_features` (int) - number of input features (source dimension)
- `out_features` (int) - number of output features (target dimension)
- `bias` (bool) - whether or not to include bias
- `self-loop` (bool or int) -  if True/False: whether adding self-loops to the graph, if set to integer: add self-loops with the starting node index set by the integer.

**Properties:**
- `weight_graph` - extended graph for weight matrix.
- `bias_graph` (optional) - extended graph for bias matrix (bias is realized as the bias matrix multiplying a all-one vector).

`forward(x, depth=None)` method implements the linear map using sparse matrix multiplication (`y = weight_matrix @ x + bias_matrix @ unit`). 
- The input data `x` must be reshaped to a matrix, with all features collected to the 0-dimension, and all batch samples collected to the 1-dimension. The layer will not transpose and untranspose the data. Data interface should be realized at higher level (in the parent class `GraphConvNet`).
- If `depth` is None, input will be forwarded through all edges in the causal graph. If `depth` is specified, input will only be forwarded through edges of the specific depth (that initiates from the source vertices of the specific depth).

Example:

In [45]:
gcl = GraphConvLayer(Lattice(2, 2).causal_graph(), 3, 2).to(device)

GraphConvLayer(
  in_features=3, out_features=2, bias=True, self_loop=True
  Graph(4x4, 7 edges of 3 types)
)

Create some input of shape `[num_vertices * in_features, batch_size]`

In [46]:
x = torch.randn(gcl.source_dim,5).to(device)

tensor([[ 5.9447e-01,  5.2489e-01,  7.3775e-01,  4.2769e-01, -3.1429e-01],
        [ 2.5169e-01, -3.5346e-01,  5.9166e-01, -2.0530e-01, -4.5000e-01],
        [-9.7972e-01,  5.5345e-01, -4.1122e-01, -8.0314e-01,  2.3169e+00],
        [-2.0738e-02, -4.3500e-01, -6.1127e-01,  1.0509e-01, -6.3991e-01],
        [ 5.9683e-01,  1.3351e+00,  1.8012e-01, -2.8274e-01, -6.0659e-01],
        [ 1.1868e+00,  4.4157e-01,  1.2528e+00, -1.3956e+00,  2.0113e-01],
        [-2.0581e+00, -9.1185e-01,  1.1170e+00,  1.6426e+00, -1.0980e-01],
        [ 5.8024e-01, -5.5019e-04,  1.4261e+00,  6.4957e-01, -2.4456e-01],
        [ 7.2026e-01,  1.3925e+00,  4.2983e-01, -1.3020e+00, -3.1377e-01],
        [-5.9194e-01,  6.0635e-02,  6.2187e-01, -3.5552e-01, -1.2573e+00],
        [-1.4416e-02,  6.1286e-01,  3.0148e-02,  1.6762e+00,  1.6216e-02],
        [-4.3150e-02, -5.4225e-01, -6.2300e-01,  2.2091e-01,  1.5071e+00]],
       device='cuda:0')

Forward the input all at once.

In [47]:
gcl(x)

tensor([[ 0.6215,  0.3852,  0.7122,  0.4704,  0.2697],
        [-0.1533, -0.4902, -0.3725, -0.1220, -0.8606],
        [ 0.6359,  0.8707,  0.4858,  0.4544,  0.2782],
        [-0.7204, -0.5583, -0.6109,  0.0922, -0.2144],
        [-0.0677, -0.4301,  0.2895,  0.6176,  0.0414],
        [-0.5260, -0.7050, -0.6042, -0.1023,  0.2385],
        [-0.6395, -1.1409, -1.1408,  0.7629, -0.1346],
        [-0.1931, -0.0516,  0.4576, -0.6535, -0.4288]], device='cuda:0',
       grad_fn=<AddBackward0>)

Forward the input from a specific depth.

In [48]:
gcl(x, 1)

tensor([[ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000],
        [ 0.5741,  0.4155,  0.9535,  0.7903,  0.4239],
        [-0.3347, -0.5885, -0.7607, -0.2467, -0.1851],
        [-0.4663, -0.9943, -1.0202, -0.0491, -0.1571],
        [ 0.2278,  0.3193,  0.5278, -0.2465, -0.2732]], device='cuda:0',
       grad_fn=<AddBackward0>)

Gradient can back propagate to parameters.

In [49]:
y = gcl(x).sum()
y.backward()
[p.grad for p in gcl.parameters()]

[tensor([-1.4736,  5.7891,  3.8095, -1.4736,  5.7891,  3.8095, -3.2036,  2.4454,
          3.3735, -3.2036,  2.4454,  3.3735, -0.3201,  2.4108,  0.9268, -0.3201,
          2.4108,  0.9268], device='cuda:0'),
 tensor([20., 20., 10., 10.,  5.,  5.], device='cuda:0')]

### Graph Convolution Network

`GraphConvNet` provides generic **non-linear causal maps** (maps that respect causality) by stacking multiple graph convolution layers (on the causal graph) and non-linear activation layers

$$\begin{split}
\zeta_a^{(0)} &= \textsf{onehot}(z_a),\\
\zeta_a^{(k+1)} &= \sum_{b\to a} \phi\big(w_{e(b\to a)}\zeta_b^{(k)}+b_{e(b\to a)}\big)\\
\end{split}$$

In theory, after infinite iterations, causal influences will propagate throughout the entire system, such that the output $\zeta^{(n)}$ will be causally dependent on all variables $z_b$ which can affect $z_a$ within $n$ steps of causal influence.

$$\zeta_a^{(n)} = f(\{z_b\}_{b\to\cdots\to a}).$$

Given that $\zeta_a^{(0)}$ is the one-hot encoding of the input configuration, if we treat $\zeta_a^{(n)}$ as a score function, we can create a model for the conditional distribution

$$\ln p(z_a|\{z_b\}_{b\to\cdots\to a}) = \zeta_a^{(0)}\cdot\ln\text{softmax}[\zeta_a^{(n)}(\{z_b\}_{b\to\cdots\to a})],$$

which can be combined to establish an autoregressive model

$$p(z)=\prod_{a}p(z_a|\{z_b\}_{b\to\cdots\to a}).$$

**Parameters:**
- `graph` (Graph) - causal graph (the unexpanded base graph) specifying the directed edges and edge types.
- `features` (list of int) - number of features from input to hidden to output.
- `bias` (bool) - whether or not to include bias
- `nonlinearity` (str) - the nonlinear activation layer to use, specified by the layer name in `torch.nn`.

**Properties:**
- `layers` (ModuleList) - a list hosting `GraphConvLayer` and nonlinear activation layers.

*Comments*:
* Note that the first layer should not have self-connections (otherwise the causal relation is no longer directed), but all subsequent layers are allow to have self-connections.
* The node 0 is somewhat special, that it is always sampled independently from uniform distribution (one can see that it corresponds to the global symmetry of the spin model). It also has no causal relations with other nodes (a consequence of the Goldstone theorem: the order parameter should have zero excitation energy, thus it can not interact with other modes and hence can not establish any causal relation).

`forward(x, depth=None, cache=None)` method implements the nonlinear multilayer graph convolution.
- `forward(x)` without depth specification will forward the input together.
- `forward(x, depth, cache)` will forward at a specific depth. For depth-specific forward, intermediate data will be cached. If the cache is not provided, the method will create a cache and returns it with the output, which should then be provided to the subsequent depth-specific forward.

<img src="./image/forward.png" alt="forward" width="450"/>

Example: create a graph convolution network.

In [50]:
gcn = GraphConvNet(Lattice(2, 2).causal_graph(), [2, 4, 3]).to(device)

GraphConvNet(
  (layers): ModuleList(
    (0): GraphConvLayer(
      in_features=2, out_features=4, bias=True, self_loop=False
      Graph(4x4, 3 edges of 2 types)
    )
    (1): Tanh()
    (2): GraphConvLayer(
      in_features=4, out_features=3, bias=True, self_loop=1
      Graph(4x4, 6 edges of 3 types)
    )
  )
)

Forward together.

In [51]:
x = torch.rand(1,4,2).to(device)
gcn(x)

tensor([[[ 0.0000,  0.0000,  0.0000],
         [-0.2935, -0.2522,  0.2858],
         [ 0.0701, -0.7526, -0.0573],
         [ 0.0932, -0.5417, -0.3421]]], device='cuda:0',
       grad_fn=<ViewBackward>)

Forward by a specific depth. For depth-specific forward, the cache will be returned together.

In [52]:
y, cache = gcn(x, 0)
y, [c.shape for c in cache]

(tensor([[[ 0.0000,  0.0000,  0.0000],
          [-0.2935, -0.2522,  0.2858],
          [ 0.4987, -0.4220, -0.3123],
          [ 0.4987, -0.4220, -0.3123]]], device='cuda:0',
        grad_fn=<ViewBackward>),
 [torch.Size([8, 1]),
  torch.Size([16, 1]),
  torch.Size([16, 1]),
  torch.Size([12, 1])])

Cache should be used for later depth-specific forward. 

In [53]:
y, cache = gcn(x, 1, cache)
y, [c.shape for c in cache]

(tensor([[[ 0.0000,  0.0000,  0.0000],
          [-0.2935, -0.2522,  0.2858],
          [ 0.0701, -0.7526, -0.0573],
          [ 0.4581, -0.3964, -0.6755]]], device='cuda:0',
        grad_fn=<ViewBackward>),
 [torch.Size([8, 1]),
  torch.Size([16, 1]),
  torch.Size([16, 1]),
  torch.Size([12, 1])])

Put together, one can foward by depth iteratively.

In [54]:
cache = None
for depth in range(gcn.graph.max_depth+1):
    y, cache = gcn(x, depth, cache)
    print('---- depth: {} ----'.format(depth))
    print(y)

---- depth: 0 ----
tensor([[[ 0.0000,  0.0000,  0.0000],
         [-0.2935, -0.2522,  0.2858],
         [ 0.4987, -0.4220, -0.3123],
         [ 0.4987, -0.4220, -0.3123]]], device='cuda:0',
       grad_fn=<ViewBackward>)
---- depth: 1 ----
tensor([[[ 0.0000,  0.0000,  0.0000],
         [-0.2935, -0.2522,  0.2858],
         [ 0.0701, -0.7526, -0.0573],
         [ 0.4581, -0.3964, -0.6755]]], device='cuda:0',
       grad_fn=<ViewBackward>)
---- depth: 2 ----
tensor([[[ 0.0000,  0.0000,  0.0000],
         [-0.2935, -0.2522,  0.2858],
         [ 0.0701, -0.7526, -0.0573],
         [ 0.0932, -0.5417, -0.3421]]], device='cuda:0',
       grad_fn=<ViewBackward>)


Check that the depth-wise forward and one-shot forward result in the same output (upto roundoff error).

In [55]:
gcn(x) - y

tensor([[[ 0.0000e+00,  0.0000e+00,  0.0000e+00],
         [ 0.0000e+00,  0.0000e+00,  0.0000e+00],
         [ 0.0000e+00,  0.0000e+00,  3.7253e-09],
         [ 0.0000e+00,  0.0000e+00, -2.9802e-08]]], device='cuda:0',
       grad_fn=<SubBackward0>)

Gradient can back propagate to parameters.

In [56]:
y.sum().backward()
[p.grad for p in gcn.parameters()]

[tensor([ 0.6074,  2.2038,  0.0089,  0.0322,  0.4353,  1.5795,  0.0107,  0.0390,
          0.6330,  0.4696,  0.2397,  0.1778,  0.5139,  0.3813, -0.2388, -0.1772],
        device='cuda:0'),
 tensor([ 2.2446,  0.0328,  1.6087,  0.0397,  0.9287,  0.3516,  0.7539, -0.3503],
        device='cuda:0'),
 tensor([ 0.1142,  0.1749, -0.1665,  0.5369,  0.1142,  0.1749, -0.1665,  0.5369,
          0.1142,  0.1749, -0.1665,  0.5369,  0.0000,  0.0000,  0.0000,  0.0000,
          0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
         -0.0132, -0.1192, -0.1378,  0.2089, -0.0132, -0.1192, -0.1378,  0.2089,
         -0.0132, -0.1192, -0.1378,  0.2089], device='cuda:0'),
 tensor([3., 3., 3., 2., 2., 2., 1., 1., 1.], device='cuda:0')]

### Autoregressive Model

Autoregressive model are usually named as "pixel-..." due to its element-wise sampling approach. The following "..." will be the neural network architecture used to model the conditional distribution. As we model the conditional distribution by graph convolutional network (GCN), it might be fair to call the resulting autoregressive model as a pixel-GCN.

`Autoregressive` combines lattice data with graph convolution network to create the autoregressive model. It provides methods to sample configurations and evalueate log probability. 

**Parameters:**
- `lattice` (Lattice) - lattice system.
- `num_classes` (int) - number of classes = group order.
- `hidde_features` (list of int) - number of features for hidden layers.
- `bias` (bool) - whether or not to include bias
- `nonlinearity` (str) - the nonlinear activation layer to use, specified by the layer name in `torch.nn`.
- `radius` (float) - the radius to specify the causal graph.

Example: create an autoregressive model.

In [57]:
ar = Autoregressive(Lattice(2, 2), SymmetricGroup(2).order, [4]).to(device)

Autoregressive(
  (gcn): GraphConvNet(
    (layers): ModuleList(
      (0): GraphConvLayer(
        in_features=2, out_features=4, bias=True, self_loop=False
        Graph(4x4, 3 edges of 2 types)
      )
      (1): Tanh()
      (2): GraphConvLayer(
        in_features=4, out_features=2, bias=True, self_loop=1
        Graph(4x4, 6 edges of 3 types)
      )
    )
  )
)

`sample(sample_size)` method generates samples. Gradient can not propagate back through sampling, because in-place update of the sample is involved in the generation process.

In [58]:
x = ar.sample(2)

tensor([[[0., 1.],
         [0., 1.],
         [1., 0.],
         [1., 0.]],

        [[1., 0.],
         [1., 0.],
         [0., 1.],
         [0., 1.]]], device='cuda:0')

`log_prob(sample)` calculates the log probability of the sample by forwarding the sample once again.

In [59]:
log_prob = ar.log_prob(x)

tensor([-2.7041, -2.8296], device='cuda:0', grad_fn=<SumBackward1>)

log probability supports gradient back propagate.

In [60]:
log_prob.sum().backward()
[p.grad for p in ar.parameters()]

[tensor([ 0.1791,  0.0250,  0.2089, -0.1092, -0.2546,  0.4728, -0.0006,  0.0406,
         -0.0385,  0.1674, -0.0315,  0.0738,  0.0504, -0.0440, -0.0632,  0.0261],
        device='cuda:0'),
 tensor([ 0.2041,  0.0997,  0.2182,  0.0400,  0.1289,  0.0424,  0.0064, -0.0371],
        device='cuda:0'),
 tensor([-0.9185,  0.8317,  0.0832, -0.5207,  0.9185, -0.8317, -0.0832,  0.5207,
          0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
         -0.5009,  0.3460,  0.1353, -0.4177,  0.5009, -0.3460, -0.1353,  0.4177],
        device='cuda:0'),
 tensor([ 0.1453, -0.1453,  0.0298, -0.0298,  0.1758, -0.1758], device='cuda:0')]

#### Advantages [*To be demonstrated yet*]

The major innovation is to put the pixel-GCN in the holographic bulk and use it to model the distribution of the Haar wavelet encodings. What could be the advantage of this approach?

* **Resolve the criticality**: the holographic mapping brings a scale-free system to a local system (with an emergnent scale set by the hyperbolic radius and the critical exponent). This can be seen from the correlation function of two spins of distance $r$ on the holographic boundary
$$C(r)\sim r^{-\alpha} \sim e^{-d/\xi},$$
where $d=R\ln r$ is the geodesic distance through the bulk and $\xi=R/\alpha$ is an emerngent length scale. The complexity of modeling correlation at all scales is reduced to modeling correlations locally in the bulk. This arguement justifies our assumption that only limited number of local causal relations need to be considered. 

* **Shorten the causal chain**: conventional approach like pixel-CNN has unnatural causal structures (why a single pixel must causally depend on its upper-half-plane?). The natural way to think about generating a image is to start paining the outline first, then add the details. In this way, the scale itself becomes the emergent time of the generation process, which impose a natural causal structure in the holographic bulk. A remarkable feature is that the holographic bulk has a hyperbolic (tree-like) geometry, such that **time is short**, i.e. the causal chain is at most of the length $\sim\log L$ (logarithmic in system size), and the causal cone has limited width (like the past light cone in an expanding universe, which light can not catch up the collapse of universe if we look backwards). This greatly reduce the model complexity for large systems and enables more efficient sampling and evaluation.

Compared to the dense version of autoregressive model, some connections outside the causal cone is removed. For example, the groups {8,9,10,11} and {12,13,14,15} have no direct mutual connections (their correlations are mediated by other nodes). More precisely speaking, they do have mutual information but they do not have *conditional* mutual information. This weakers the model, but that could be the price to pay for efficiency.

| model     | connections     | parameters    |
|-----------|-----------------|---------------|
| dense     | $\sim N^2$      | $\sim N^2$    |
| pixel-GCN | $\sim N \log N$ | $\sim \log N$ |
| pixel-CNN | $\sim N$        | $\sim 1$      |
| pixel-RNN | $\sim N$        | $\sim 1$      |

Their performance should be further compared in the future.

### Holographic Pixel GNN

`HolographicPixelGNN` puts all components together to create the interface module.

**Parameters:**
- `model` (Model) - energy model containing information of lattice, group and Hamiltonian.
- `hidde_features` (list of int) - number of features for hidden layers.
- `bias` (bool) - whether or not to include bias
- `nonlinearity` (str) - the nonlinear activation layer to use, specified by the layer name in `torch.nn`.
- `radius` (float) - the radius to specify the causal graph.

Create a holographic pixel-GNN model. It has the following components.

In [61]:
H = lambda J: -J*(TwoBody([1,0],[1,-1]) + TwoBody([0,1],[1,-1]))
model = HolographicPixelGNN(Model(H(0.5), Lattice(2, 2), SymmetricGroup(2)), hidden_features = [3]).to(device)

HolographicPixelGNN(
  (model): Model(
    (lattice): Lattice(2x2 grid)
    (group): Group(order=2)
    (energy): EnergyTerms(
      (0): TwoBody(G -> [-0.5, 0.5] across [1, 0])
      (1): TwoBody(G -> [-0.5, 0.5] across [0, 1])
    )
  )
  (generator): Autoregressive(
    (gcn): GraphConvNet(
      (layers): ModuleList(
        (0): GraphConvLayer(
          in_features=2, out_features=3, bias=True, self_loop=False
          Graph(4x4, 3 edges of 2 types)
        )
        (1): Tanh()
        (2): GraphConvLayer(
          in_features=3, out_features=2, bias=True, self_loop=1
          Graph(4x4, 6 edges of 3 types)
        )
      )
    )
  )
)

Draw samples from the model.

In [62]:
x = model.sample(2)
x

tensor([[[1, 1],
         [1, 0]],

        [[0, 1],
         [0, 0]]], device='cuda:0')

Evaluate log probabilities of the samples.

In [63]:
model.log_prob(x)

tensor([-3.8696, -2.3435], device='cuda:0', grad_fn=<SumBackward1>)

Evaluate energies of the samples.

In [64]:
model.energy(x)

tensor([0., 0.], device='cuda:0')

Trasnform the samples to Haar wavelet configurations.

In [65]:
model.haar.inv(x)

tensor([[1, 0, 0, 1],
        [0, 0, 1, 0]], device='cuda:0')

`loss(sample_size)` method calculates the loss function over a set of samples specified by the sample size.

In [66]:
loss = model.loss(100)

tensor(0.3887, device='cuda:0', grad_fn=<MeanBackward0>)

Gradient can propagate back to parameters.

In [67]:
loss.backward()
[p.grad for p in model.parameters()]

[tensor([ 0.3909,  0.2479,  0.0942,  0.0597, -0.4486, -0.4112,  0.0793,  0.1341,
          0.0144,  0.0255, -0.0770, -0.2184], device='cuda:0'),
 tensor([ 0.6388,  0.1539, -0.8598,  0.2134,  0.0398, -0.2954], device='cuda:0'),
 tensor([ 0.1589,  0.2060, -0.0349, -0.1589, -0.2060,  0.0349,  0.0000,  0.0000,
          0.0000,  0.0000,  0.0000,  0.0000,  0.0396,  0.0401,  0.0323, -0.0396,
         -0.0401, -0.0323], device='cuda:0'),
 tensor([-1.5194,  1.5194, -1.0137,  1.0137, -0.3542,  0.3542], device='cuda:0')]

#### Loss Function

**Reverse KL with log-trick**. The goal is to minimize the difference between the model distribution $q_\theta(x)$ and the target distribution $p(x) \propto e^{-E(x)}$ by minimizing the reverse KL divergence (see [Wu, Wang, Zhang 2019](https://arxiv.org/pdf/1809.10606.pdf) for more details)

$$\begin{split}\mathcal{L}&=\mathsf{KL}(q_\theta||p)\\
&=\sum_{x} q_\theta(x) \ln \frac{q_\theta(x)}{p(x)}\\
&=\sum_{x}q_\theta(x)(E(x)+\ln q_\theta(x)). 
\end{split}$$

All the parameter dependence is in the model distribution $q_\theta$. The gradient of the loss function with respect to the parameters is given by

$$\begin{split}\partial_\theta\mathcal{L}&= \partial_\theta \sum_{x}q_\theta(x)(E(x)+\ln q_\theta(x))\\
&= \sum_{x}[(\partial_\theta q_\theta(x))(E(x)+\ln q_\theta(x))+q_\theta(x)\partial_\theta \ln q_\theta(x)]\\
\end{split}$$

The last term can be dropped because

$$\sum_x q_\theta(x)\partial_\theta \ln q_\theta(x) = \sum_x \partial_\theta q_\theta(x)=\partial_\theta\sum_x q_\theta(x)=\partial_\theta 1 = 0,$$

the remaining term reads

$$\begin{split}\partial_\theta\mathcal{L}&= \sum_{x}(\partial_\theta q_\theta(x))(E(x)+\ln q_\theta(x))\\
&= \sum_{x}(\partial_\theta q_\theta(x))R(x)\\
&= \mathbb{E}_{x\sim q_\theta}(\partial_\theta \ln q_\theta(x))R(x)\\
\end{split}$$

with a reward signal $R(x)=E(x)+\ln q_\theta(x)$ in the context of reinforcement learning. The gradient signal $\partial_\theta \ln q_\theta(x)$ is weighted by $R(x)$, such that when $R(x)$ is large for a configuration $x$, the gradient descent will decrease the log likelihood $\ln q_\theta(x)$ for that configuration, hence the optimzation will try to reduce the free energy.

However we should not just drop the last term for finite batches, instead we should introduce a Lagrangian multiplier to balance unphysical the gradient signal that tries to change the normalization of $q_\theta$. This amounts to subtracting $R(x)$ by a baseline value $b=\mathbb{E}_{x\sim q_\theta} R(x)$, which can be estimated within each batch. The baseline subtraction helps to reduce the variance of the gradient.

## Model Training

### Training

Set up a model and link to an optimizer.

In [73]:
%run "model.py"
H = lambda J: -J*(TwoBody([1,0],[1,-1]) + TwoBody([0,1],[1,-1]))
model = HolographicPixelGNN(Model(H(0.440686793), Lattice(4, 2), SymmetricGroup(2)), 
                            hidden_features = [8, 8], radius = 1.5).to(device)
optimizer = optim.Adam(model.parameters(), lr=0.02);

Start training

In [74]:
batch_size = 100
echo = 100
cum_loss, cum_meanfree,cum_stdfree = 0., 0., 0.
for k in range(2000):
    loss, meanfree, stdfree = model.loss(batch_size, return_statistics=True)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    cum_loss += loss.item()
    cum_meanfree += meanfree
    cum_stdfree += stdfree
    if (k+1)%echo == 0:
        print('{:5} loss: {:8.4f}, free energy: {:8.4f} ±{:8.4f}'.format(k+1, cum_loss/echo, cum_meanfree/echo, cum_stdfree/echo))
        cum_loss, cum_meanfree,cum_stdfree = 0., 0., 0.

  100 loss:  -0.2314, free energy: -14.7981 ±  0.7171
  200 loss:   0.0825, free energy: -15.1021 ±  0.5016
  300 loss:   0.0591, free energy: -15.2293 ±  0.4914
  400 loss:   0.0678, free energy: -15.3115 ±  0.4409
  500 loss:   0.0574, free energy: -15.3446 ±  0.4095
  600 loss:   0.0264, free energy: -15.3584 ±  0.3960
  700 loss:   0.0453, free energy: -15.3710 ±  0.3856
  800 loss:   0.0482, free energy: -15.3783 ±  0.3955
  900 loss:   0.0433, free energy: -15.3670 ±  0.3856
 1000 loss:   0.0432, free energy: -15.3751 ±  0.3675
 1100 loss:   0.0210, free energy: -15.3802 ±  0.3589
 1200 loss:   0.0582, free energy: -15.3893 ±  0.3818
 1300 loss:   0.0410, free energy: -15.4078 ±  0.3516
 1400 loss:   0.0033, free energy: -15.4208 ±  0.3344
 1500 loss:   0.0652, free energy: -15.4009 ±  0.3858
 1600 loss:   0.0442, free energy: -15.4287 ±  0.3289
 1700 loss:   0.0191, free energy: -15.4189 ±  0.3643
 1800 loss:   0.0288, free energy: -15.4320 ±  0.3271
 1900 loss:   0.0473, free e

The model converges to a free energy of -15.42, while the exact value is -15.52. The relative error is about 0.6%. It seems that this has saturated the representation power of the pixel-GCN. What is the cause of the mismatch? How to improve it?

### Attempts to Improve

#### Extend the Causal Graph

One conjecture is that the loss is due to the missing causal connections. We can complete the causal connections and retrain. The causal graph scales with radius as:

In [80]:
latt = Lattice(4, 2)
for r in [1.,1.2,1.5,2.,3.,5.]:
    graph = latt.causal_graph(r)
    print('r = {}: {}, max_depth = {}'.format(r, graph, graph.max_depth))

r = 1.0: Graph(16x16, 51 edges of 7 types), max_depth = 5
r = 1.2: Graph(16x16, 67 edges of 9 types), max_depth = 7
r = 1.5: Graph(16x16, 85 edges of 9 types), max_depth = 10
r = 2.0: Graph(16x16, 101 edges of 9 types), max_depth = 12
r = 3.0: Graph(16x16, 105 edges of 9 types), max_depth = 14
r = 5.0: Graph(16x16, 105 edges of 9 types), max_depth = 14


With $r>3$ the causal relation is complete.

In [81]:
%run "model.py"
H = lambda J: -J*(TwoBody([1,0],[1,-1]) + TwoBody([0,1],[1,-1]))
model = HolographicPixelGNN(Model(H(0.440686793), Lattice(4, 2), SymmetricGroup(2)), 
                            hidden_features = [8, 8], radius = 3.).to(device)
optimizer = optim.Adam(model.parameters(), lr=0.02);

In [82]:
batch_size = 100
echo = 100
cum_loss, cum_meanfree,cum_stdfree = 0., 0., 0.
for k in range(2000):
    loss, meanfree, stdfree = model.loss(batch_size, return_statistics=True)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    cum_loss += loss.item()
    cum_meanfree += meanfree
    cum_stdfree += stdfree
    if (k+1)%echo == 0:
        print('{:5} loss: {:8.4f}, free energy: {:8.4f} ±{:8.4f}'.format(k+1, cum_loss/echo, cum_meanfree/echo, cum_stdfree/echo))
        cum_loss, cum_meanfree,cum_stdfree = 0., 0., 0.

  100 loss:  -0.4596, free energy: -14.6877 ±  0.8562
  200 loss:   0.0217, free energy: -15.1392 ±  0.4337
  300 loss:   0.0867, free energy: -15.1900 ±  0.3865
  400 loss:   0.0949, free energy: -15.1332 ±  0.4544
  500 loss:   0.1035, free energy: -15.2369 ±  0.4765
  600 loss:   0.0407, free energy: -15.3104 ±  0.4476
  700 loss:   0.0558, free energy: -15.3249 ±  0.4105
  800 loss:   0.0513, free energy: -15.3621 ±  0.3938
  900 loss:   0.0345, free energy: -15.3556 ±  0.4020
 1000 loss:   0.0569, free energy: -15.3631 ±  0.4116
 1100 loss:   0.0722, free energy: -15.3677 ±  0.4034
 1200 loss:   0.0523, free energy: -15.3754 ±  0.3992
 1300 loss:   0.0702, free energy: -15.3546 ±  0.4190
 1400 loss:   0.0262, free energy: -15.3819 ±  0.3673
 1500 loss:   0.0572, free energy: -15.3878 ±  0.3633
 1600 loss:   0.0976, free energy: -15.3410 ±  0.4291
 1700 loss:   0.0478, free energy: -15.3749 ±  0.4134
 1800 loss:   0.0572, free energy: -15.3956 ±  0.3828
 1900 loss:   0.0428, free e

Performance even worse, trains slowly and trapped by local minimum. Probabily the causal connection is not the issue.

#### Expand the Features

Another possibility is to increase the number of hidden features.

In [90]:
%run "model.py"
H = lambda J: -J*(TwoBody([1,0],[1,-1]) + TwoBody([0,1],[1,-1]))
model = HolographicPixelGNN(Model(H(0.440686793), Lattice(4, 2), SymmetricGroup(2)), 
                            hidden_features = [12, 12], radius = 1.5).to(device)
optimizer = optim.Adam(model.parameters(), lr=0.02);

In [93]:
batch_size = 100
echo = 100
cum_loss, cum_meanfree,cum_stdfree = 0., 0., 0.
for k in range(2000):
    loss, meanfree, stdfree = model.loss(batch_size, return_statistics=True)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    cum_loss += loss.item()
    cum_meanfree += meanfree
    cum_stdfree += stdfree
    if (k+1)%echo == 0:
        print('{:5} loss: {:8.4f}, free energy: {:8.4f} ±{:8.4f}'.format(k+1, cum_loss/echo, cum_meanfree/echo, cum_stdfree/echo))
        cum_loss, cum_meanfree,cum_stdfree = 0., 0., 0.

  100 loss:   0.0523, free energy: -15.4052 ±  0.4019
  200 loss:   0.0535, free energy: -15.4066 ±  0.3529
  300 loss:   0.0543, free energy: -15.4080 ±  0.3700
  400 loss:   0.0546, free energy: -15.4276 ±  0.3403
  500 loss:   0.0726, free energy: -15.4121 ±  0.3731
  600 loss:   0.0598, free energy: -15.3932 ±  0.3839
  700 loss:   0.0798, free energy: -15.3742 ±  0.4184
  800 loss:   0.0671, free energy: -15.4210 ±  0.3666
  900 loss:   0.0499, free energy: -15.4031 ±  0.3581
 1000 loss:   0.0849, free energy: -15.3191 ±  0.5003
 1100 loss:   0.0813, free energy: -15.3735 ±  0.4123
 1200 loss:   0.0528, free energy: -15.3977 ±  0.4044
 1300 loss:   0.1155, free energy: -15.3743 ±  0.4518
 1400 loss:   0.0703, free energy: -15.4000 ±  0.3780
 1500 loss:   0.0869, free energy: -15.4088 ±  0.4005
 1600 loss:   0.0436, free energy: -15.4161 ±  0.3595
 1700 loss:   0.0326, free energy: -15.4157 ±  0.3538
 1800 loss:   0.0654, free energy: -15.4120 ±  0.3888
 1900 loss:   0.0463, free e

There is almost no improvement. But it becomes easier to be trapped at local minimum, and coverges more slowly. If we further increase the number of features, the performance could get even worse.

#### Increase the Batch Size

In [94]:
%run "model.py"
H = lambda J: -J*(TwoBody([1,0],[1,-1]) + TwoBody([0,1],[1,-1]))
model = HolographicPixelGNN(Model(H(0.440686793), Lattice(4, 2), SymmetricGroup(2)), 
                            hidden_features = [8, 8], radius = 1.5).to(device)
optimizer = optim.Adam(model.parameters(), lr=0.02);

In [95]:
batch_size = 500
echo = 100
cum_loss, cum_meanfree,cum_stdfree = 0., 0., 0.
for k in range(2000):
    loss, meanfree, stdfree = model.loss(batch_size, return_statistics=True)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    cum_loss += loss.item()
    cum_meanfree += meanfree
    cum_stdfree += stdfree
    if (k+1)%echo == 0:
        print('{:5} loss: {:8.4f}, free energy: {:8.4f} ±{:8.4f}'.format(k+1, cum_loss/echo, cum_meanfree/echo, cum_stdfree/echo))
        cum_loss, cum_meanfree,cum_stdfree = 0., 0., 0.

  100 loss:  -0.1804, free energy: -14.9131 ±  0.7236
  200 loss:   0.0854, free energy: -15.3049 ±  0.4273
  300 loss:   0.0138, free energy: -15.3876 ±  0.3390
  400 loss:   0.0017, free energy: -15.4102 ±  0.3071
  500 loss:   0.0004, free energy: -15.4262 ±  0.3081
  600 loss:   0.0254, free energy: -15.4250 ±  0.3149
  700 loss:   0.0082, free energy: -15.4388 ±  0.2856
  800 loss:   0.0130, free energy: -15.4397 ±  0.3010
  900 loss:   0.0179, free energy: -15.4424 ±  0.3159
 1000 loss:   0.0087, free energy: -15.4515 ±  0.2902
 1100 loss:   0.0230, free energy: -15.4520 ±  0.3067
 1200 loss:   0.0058, free energy: -15.4580 ±  0.3007
 1300 loss:   0.0154, free energy: -15.4627 ±  0.3004
 1400 loss:   0.0119, free energy: -15.4672 ±  0.2909
 1500 loss:   0.0126, free energy: -15.4681 ±  0.2767
 1600 loss:   0.0053, free energy: -15.4713 ±  0.2700
 1700 loss:   0.0069, free energy: -15.4740 ±  0.2747
 1800 loss:   0.0061, free energy: -15.4748 ±  0.2642
 1900 loss:   0.0118, free e

Increasing the batch size could further improve the result slightly. But this is a little cheating as we have allowed machine to sample more configurations. 

#### Increase the Depth

In [96]:
%run "model.py"
H = lambda J: -J*(TwoBody([1,0],[1,-1]) + TwoBody([0,1],[1,-1]))
model = HolographicPixelGNN(Model(H(0.440686793), Lattice(4, 2), SymmetricGroup(2)), 
                            hidden_features = [8, 8, 8, 8], radius = 1.5).to(device)
optimizer = optim.Adam(model.parameters(), lr=0.02);

In [97]:
batch_size = 100
echo = 100
cum_loss, cum_meanfree,cum_stdfree = 0., 0., 0.
for k in range(2000):
    loss, meanfree, stdfree = model.loss(batch_size, return_statistics=True)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    cum_loss += loss.item()
    cum_meanfree += meanfree
    cum_stdfree += stdfree
    if (k+1)%echo == 0:
        print('{:5} loss: {:8.4f}, free energy: {:8.4f} ±{:8.4f}'.format(k+1, cum_loss/echo, cum_meanfree/echo, cum_stdfree/echo))
        cum_loss, cum_meanfree,cum_stdfree = 0., 0., 0.

  100 loss:  -0.0993, free energy: -14.8069 ±  0.5581
  200 loss:   0.0333, free energy: -15.0457 ±  0.2860
  300 loss:   0.0182, free energy: -15.0470 ±  0.2752
  400 loss:   0.0444, free energy: -15.0451 ±  0.2793
  500 loss:   0.0230, free energy: -15.0531 ±  0.2365
  600 loss:   0.0616, free energy: -15.0444 ±  0.3277
  700 loss:   0.0586, free energy: -15.0440 ±  0.2984
  800 loss:   0.0244, free energy: -15.0547 ±  0.2376
  900 loss:   0.0130, free energy: -15.0530 ±  0.2245
 1000 loss:   0.0399, free energy: -15.0602 ±  0.2846
 1100 loss:   0.0413, free energy: -15.0463 ±  0.2617
 1200 loss:   0.0755, free energy: -15.0242 ±  0.3260
 1300 loss:   0.0364, free energy: -15.0533 ±  0.2478
 1400 loss:   0.0334, free energy: -15.0512 ±  0.2704
 1500 loss:   0.0471, free energy: -15.0344 ±  0.2870
 1600 loss:   0.0297, free energy: -15.0531 ±  0.2666
 1700 loss:   0.0432, free energy: -14.8294 ±  0.1105
 1800 loss:   0.0746, free energy: -14.9274 ±  0.2421
 1900 loss:   0.0463, free e

Trapped at local minimum and the performance gets worse.

### Profiling

Profiling a single training iteration with snakeviz. Sampling takes the most time. How to improve that?

In [3]:
%load_ext snakeviz

In [4]:
%run "model.py"
H = lambda J: -J*(TwoBody([1,0],[1,-1]) + TwoBody([0,1],[1,-1]))
model = HolographicPixelGNN(Model(H(0.440686793), Lattice(4, 2), SymmetricGroup(2)), 
                            hidden_features = [8, 8], radius = 1.5).to(device)
optimizer = optim.Adam(model.parameters(), lr=0.02);

In [5]:
%%snakeviz
batch_size = 100
loss = model.loss(batch_size)
optimizer.zero_grad()
loss.backward()
optimizer.step()

 
*** Profile stats marshalled to file '/var/folders/1m/3nz1kxmj2mgb2s2gwq2ndxqh0000gn/T/tmp84901dlh'. 
Embedding SnakeViz in this document...


### Correlation Function

Generate samples

In [48]:
x = model.sample(1000000)

(1,0)-correlation

In [49]:
dims = tuple(range(-model.energy.lattice.dimension,0))
rolled = model.energy.group.inv(x.roll((-1,0), dims))
coupled = model.energy.group.mul(rolled, x)
corr = model.energy.group.val(coupled, torch.tensor([1.,-1.]))
corr.mean(0)

tensor([[0.8378, 0.8263, 0.8380, 0.8232],
        [0.8351, 0.7778, 0.8122, 0.7820],
        [0.8344, 0.8210, 0.8385, 0.8146],
        [0.8455, 0.7819, 0.8168, 0.7844]])

(0,1)-correlation

In [50]:
dims = tuple(range(-model.energy.lattice.dimension,0))
rolled = model.energy.group.inv(x.roll((0,-1), dims))
coupled = model.energy.group.mul(rolled, x)
corr = model.energy.group.val(coupled, torch.tensor([1.,-1.]))
corr.mean(0)

tensor([[0.8518, 0.8516, 0.8547, 0.8490],
        [0.8358, 0.8342, 0.8493, 0.8247],
        [0.8424, 0.8389, 0.8473, 0.8344],
        [0.8323, 0.8283, 0.8395, 0.8255]])

One can see that the correlation is not translationally symmetric. This might be the main source of error.

## Mixture Model

### Theory

To restore the translation symmetry, one idea is to symmetrize the sample by random translations. This amounts to model the target probability $p(x)$ by a mixtrue model
$$q_\theta(x) = \frac{1}{N}\sum_a q_\theta(x+a),$$
where $a$ is summed over the translation group of the lattice and $N$ is the order of the translation group.

The loss function becomes
$$\mathcal{L}=\sum_x \Big(\frac{1}{N}\sum_a q_\theta(x+a)\Big)\bigg(E(x)+\ln \Big(\frac{1}{N}\sum_a q_\theta(x+a)\Big)\bigg).$$
Using the inequality
$$\Big(\frac{1}{N}\sum_a q_\theta(x+a)\Big)\ln \Big(\frac{1}{N}\sum_a q_\theta(x+a)\Big)\leq \frac{1}{N}\sum_a q_\theta(x+a)\ln q_\theta(x+a),$$
the loss function can be bounded by 
$$\mathcal{L}\leq \frac{1}{N}\sum_a\sum_x  q_\theta(x+a)\big(E(x)+\ln q_\theta(x+a)\big)\equiv \bar{\mathcal{L}}.$$
By brining $\mathcal{L}$ towards zero, the actual loss will also approach to zero.

Using the summation of $x$, we can redefine $x+a = x$, such that
$$\bar{\mathcal{L}}=\frac{1}{N}\sum_a\sum_x  q_\theta(x)\big(E(x-a)+\ln q_\theta(x)\big)=\sum_x  q_\theta(x)\big(E(x)+\ln q_\theta(x)\big).$$
The loss function actually remains the same, because the energy function is symmetric under translation.

### Experiment

In [21]:
%run "model.py"
H = lambda J: -J*(TwoBody(torch.tensor([1.,-1.]), (1,0)) 
                  + TwoBody(torch.tensor([1.,-1.]), (0,1)))
model = HolographicPixelGNN(
            Model(
                H(0.440686793), # Ising critical point
                SymmetricGroup(2), 
                Lattice(4, 2)), 
            hidden_features = [4, 4])

In [22]:
optimizer = optim.Adam(model.parameters(), lr=0.01)

In [23]:
batch_size = 100
train_loss = 0.
free_energy = 0.
echo = 100
for epoch in range(2000):
    x = model.sample(batch_size)
    log_prob = model.log_prob(x)
    energy = model.energy(x)
    free = energy + log_prob.detach()
    meanfree = free.mean()
    loss = torch.sum(log_prob * (free - meanfree))
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    train_loss += loss.item()
    free_energy += meanfree.item()
    if (epoch+1)%echo == 0:
        print('{:5} loss: {:8.4f}, free energy: {:8.4f}'.format(epoch+1, train_loss/echo, free_energy/echo))
        train_loss = 0.
        free_energy = 0.

  100 loss: -128.0252, free energy: -14.3258
  200 loss:  -9.1453, free energy: -15.2334
  300 loss:   3.0321, free energy: -15.2826
  400 loss:   0.0435, free energy: -15.2954
  500 loss:  -1.6938, free energy: -15.3154
  600 loss:   0.3055, free energy: -15.3347
  700 loss:  -3.3122, free energy: -15.3460
  800 loss:  -0.0673, free energy: -15.3751
  900 loss:  -3.4675, free energy: -15.3795
 1000 loss:  -2.2874, free energy: -15.3904
 1100 loss:  -5.7892, free energy: -15.3803
 1200 loss:  -4.3349, free energy: -15.3864
 1300 loss:  -2.3260, free energy: -15.3953
 1400 loss:  -1.9631, free energy: -15.4171
 1500 loss:  -3.2912, free energy: -15.4215
 1600 loss:  -2.9325, free energy: -15.4262
 1700 loss:  -0.6374, free energy: -15.4272
 1800 loss:  -1.5955, free energy: -15.4323
 1900 loss:  -1.7919, free energy: -15.4271
 2000 loss:   1.2615, free energy: -15.4353


In [25]:
with torch.no_grad():
    F = []
    for i in range(echo):
        F.append(model.free_energy(model.sample(batch_size), rot=False).mean())
F = torch.tensor(F)
print(F)
F.mean(), F.std()

tensor([-15.5328, -15.4791, -15.4780, -15.4704, -15.4916, -15.5104, -15.4654,
        -15.4926, -15.4941, -15.5059, -15.5071, -15.5008, -15.4807, -15.5108,
        -15.4884, -15.5086, -15.4986, -15.4825, -15.4900, -15.4779, -15.5024,
        -15.5043, -15.4846, -15.5295, -15.4797, -15.4762, -15.4834, -15.4789,
        -15.5423, -15.5269, -15.4888, -15.4732, -15.5243, -15.5040, -15.4393,
        -15.4942, -15.4827, -15.4778, -15.4804, -15.5104, -15.4958, -15.5068,
        -15.4886, -15.4729, -15.4854, -15.4967, -15.4322, -15.5044, -15.4833,
        -15.4693, -15.5313, -15.4983, -15.4741, -15.5354, -15.5384, -15.4687,
        -15.4930, -15.4858, -15.5021, -15.5568, -15.5000, -15.4812, -15.4534,
        -15.4874, -15.4596, -15.5264, -15.4516, -15.4621, -15.4797, -15.4916,
        -15.5320, -15.5010, -15.5241, -15.4993, -15.4920, -15.4831, -15.4822,
        -15.4846, -15.5249, -15.4885, -15.5109, -15.4655, -15.4739, -15.5028,
        -15.4846, -15.5289, -15.4891, -15.4938, -15.5611, -15.49

(tensor(-15.4940), tensor(0.0230))

The performance is significantly improved. Using the mixture model estimation, the variational free energy can reach -15.49, which is 0.2% above -15.52.

Now the main issue is the training efficiency. Consider
* removing the layer norm?