In [2]:
%run 'model.py'

# Autoregressive Model

Autoregressive model is a generative model that models the joint probability distributition by product of conditional distributions
$$p(x_1,x_2,x_3,\cdots)=p(x_1)p(x_2|x_1)p(x_3|x_1,x_2)\cdots = \prod_{i}p(x_i|x_1,\cdots,x_{i-1}).$$
The parameters of the coditional distributions will be calculated by neural networks. However, due to the autoregressive causal dependence of the conditional probability on the input variables, the neural network should be masked properly, in order to respect the same causal structure.

## Autoregressive Linear Layer

A key component is to realize an autoregressive linear layer, which maps $x=(x_1,x_2,\cdots)$ to $y=(y_1,y_2,\cdots)$ via
$$y = W\cdot x + b,$$
respecting the causality that $y_i$ only depends on $x_1,\cdots,x_{i-1}$. This can be achieved by requiring the weight matrix $W$ to take a *lower-trianglar* form
$$W = \begin{bmatrix}
0 & 0 & 0 & \cdots & 0\\
W_{21} & 0 & 0 & \cdots & 0\\
W_{31} & W_{32} & 0 & \cdots & 0\\
\vdots & \vdots & \vdots & \ddots & \vdots\\
W_{n1} & W_{n2} & W_{n3} & \cdots & 0\\
\end{bmatrix}$$
For PyTorch realization, we can first greate a raw weight matrix, which is a full matrix. Then construct the actual weight matrix by truncating the full matrix to its low-triangle part. This can be implemented by `torch.tril` (which allows gradient backpropagation).

### Toy Example
Create a full weight matrix `w_full` and truncate it to the triangular weight matrix `w_tril`. The function `torch.tril` takes a argument `diagonal` to specify the truncation to which diagonal (inclusively).

In [2]:
w_full = torch.ones(3, 3, requires_grad = True)
w_tril = torch.tril(w_full, diagonal = -1)
print(w_full)
print(w_tril)

tensor([[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]], requires_grad=True)
tensor([[0., 0., 0.],
        [1., 0., 0.],
        [1., 1., 0.]], grad_fn=<TrilBackward>)


We can then use the triangular weight matrix in the remaining computation task to evaluate the loss function. For example, the loss funcion is simply the 2-norm of the triangular weight matrix (just to get some scalar score for the matrix).

In [3]:
loss = torch.sum(w_tril**2)
loss

tensor(3., grad_fn=<SumBackward0>)

Now we can gradient back propagate and check how the raw weight matrix will receive the gradient.

In [4]:
loss.backward()
w_full.grad

tensor([[0., 0., 0.],
        [2., 0., 0.],
        [2., 2., 0.]])

We can see that the gadient is automatically masked as well. The upper triangle does not receive any gradient signal. If we put `w_full` into an optimizer to minimize the loss, the lower triangle of the weight matrix will be trained to zero (as favored by the loss function).

In [28]:
optimizer = optim.Adam([w_full], lr = 0.1)
for epoch in range(500):
    w_tril = torch.tril(w_full, diagonal = -1)
    loss = torch.sum(w_tril**2)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
w_full

tensor([[ 1.0000e+00,  1.0000e+00,  1.0000e+00],
        [-4.9447e-13,  1.0000e+00,  1.0000e+00],
        [-4.9447e-13, -4.9447e-13,  1.0000e+00]], requires_grad=True)

### Pack to Torch Module

We can pack the above functionality to a neural network Module in PyTorch, inherited from the standard linear layer `nn.Linear`.

In [6]:
class AutoregressiveLinear(nn.Linear):
    """ Applies a lienar transformation to the incoming data, 
        with the weight matrix masked to the lower-triangle. 
        
        Args:
        in_features: size of each input sample
        out_features: size of each output sample
        bias: If set to ``False``, the layer will not learn an additive bias.
            Default: ``True``
        diagonal: the diagonal to trucate to"""
    
    def __init__(self, in_features, out_features, bias=True, diagonal=0):
        super(AutoregressiveLinear, self).__init__(in_features, out_features, bias)
        self.diagonal = diagonal
    
    def extra_repr(self):
        return super(AutoregressiveLinear, self).extra_repr() + ', diagonal={}'.format(self.diagonal)
    
    # overwrite forward pass
    def forward(self, input):
        return F.linear(input, torch.tril(self.weight, self.diagonal), self.bias)
    
    def forward_at(self, input, i):
        output = input.matmul(torch.tril(self.weight, self.diagonal).narrow(0, i, 1).t())
        if self.bias is not None:
            output += self.bias.narrow(0, i, 1)
        return output.squeeze()

To test this module, let us create some data. The target $y$ is related to the input $x$ by $y_i=\sum_{j=1}^{i}x_j$, which can be modeled by an autoregressive linear transformation, with weight being a lower-triangular matrix with all 1 below diagonal 0, and no bias.

In [7]:
input = torch.randn(10, 5)
target = torch.cumsum(input, axis = 1)
input, target

(tensor([[ 1.2537,  1.2952, -0.7514,  1.6626,  0.2934],
         [-0.8714, -2.4656,  0.7002, -0.4921, -0.5424],
         [ 0.3850, -0.4022, -1.7203, -0.8830,  0.0989],
         [-2.1567,  1.3889, -1.0016,  0.4516,  0.4419],
         [ 0.6539,  1.0505,  0.3454,  1.2798, -0.0401],
         [-0.9391,  0.2523,  0.1364, -1.1801,  1.5261],
         [-0.8578,  0.0364, -0.0223, -0.0495, -1.2166],
         [ 0.4616,  2.1005, -0.1037, -0.9643,  0.7355],
         [ 0.3906, -0.9893,  0.9771, -0.6446, -0.0489],
         [ 1.2490,  1.2904, -1.4693, -1.5764,  0.8918]]),
 tensor([[ 1.2537,  2.5489,  1.7975,  3.4601,  3.7535],
         [-0.8714, -3.3371, -2.6368, -3.1289, -3.6714],
         [ 0.3850, -0.0172, -1.7375, -2.6205, -2.5216],
         [-2.1567, -0.7679, -1.7695, -1.3179, -0.8760],
         [ 0.6539,  1.7044,  2.0497,  3.3296,  3.2895],
         [-0.9391, -0.6868, -0.5505, -1.7306, -0.2045],
         [-0.8578, -0.8213, -0.8436, -0.8931, -2.1097],
         [ 0.4616,  2.5622,  2.4585,  1.4942, 

Supervised learning with mean-square-error (MSE) loss.

In [8]:
model = AutoregressiveLinear(5, 5)
loss_op = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=1.)
train_loss = 0.
for epoch in range(500):
    loss = loss_op(model(input), target)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    train_loss += loss.item()
    if (epoch+1)%100 == 0:
        print('loss : {:.4f}'.format(train_loss / 100))
        train_loss = 0.

loss : 0.1874
loss : 0.0000
loss : 0.0000
loss : 0.0000
loss : 0.0000


As training converges, we inspect the model parameters.

In [9]:
list(model.parameters())

[Parameter containing:
 tensor([[ 1.0000,  0.4060, -0.0309, -0.4228, -0.0073],
         [ 1.0000,  1.0000, -0.3395, -0.3059,  0.4422],
         [ 1.0000,  1.0000,  1.0000,  0.3253, -0.0558],
         [ 1.0000,  1.0000,  1.0000,  1.0000,  0.1358],
         [ 1.0000,  1.0000,  1.0000,  1.0000,  1.0000]], requires_grad=True),
 Parameter containing:
 tensor([-1.4105e-08,  2.4496e-10, -2.0742e-07, -1.4842e-07, -5.9154e-08],
        requires_grad=True)]

The weight matrix indeed becomes a lower-triangular matrix of 1's and the bias indeed vanishes.

## Generative Model

We can use autoregressive linear layers to build the autoregressive model. As a generative model, the autoregressive model must provide two functionalities:
- `log_prob(input)` evaluating the log probability of a batch of samples as `input`,
- `sample(batch_size)` generating a batch of samples given the `batch_size`, according to the model probability distribtuion.

We realize these functionalities in the neural network module `AutoregressiveModel`.

In [10]:
class AutoregressiveModel(nn.Module):
    """ Represent a generative model that can generate samples and provide log probability evaluations.
        
        Args:
        features: size of each sample
        depth: depth of the neural network (in number of linear layers) (default=1)
        nonlinearity: activation function to use (default='ReLU') """
    
    def __init__(self, features, depth=1, nonlinearity='ReLU'):
        super(AutoregressiveModel, self).__init__()
        self.features = features # number of features
        self.layers = nn.ModuleList()
        for i in range(depth):
            if i == 0: # first autoregressive linear layer must have diagonal=-1
                self.layers.append(AutoregressiveLinear(self.features, self.features, diagonal = -1))
            else: # remaining autoregressive linear layers have diagonal=0 (by default)
                self.layers.append(AutoregressiveLinear(self.features, self.features))
            if i == depth-1: # the last layer must be Sigmoid
                self.layers.append(nn.Sigmoid())
            else: # other layers use the specified nonlinearity
                self.layers.append(getattr(nn, nonlinearity)())
    
    def extra_repr(self):
        return '(features): {}'.format(self.features) + super(AutoregressiveModel, self).extra_repr()
    
    def forward(self, input):
        prob = input # prob as a workspace, initialized to input
        for layer in self.layers: # apply layers
            prob = layer(prob)
        return prob # prob holds predicted Beroulli probability parameters
    
    def log_prob(self, input):
        prob = self(input) # forward pass to get Beroulli probability parameters
        return torch.sum(dist.Bernoulli(prob).log_prob(input), axis=-1)
    
    def sample(self, batch_size=1):
        with torch.no_grad(): # no gradient for sample generation
            # create a record to host layerwise outputs
            record = torch.zeros(len(self.layers)+1, batch_size, self.features)
            # autoregressive batch sampling
            for i in range(self.features):
                for l, layer in enumerate(self.layers):
                    if isinstance(layer, AutoregressiveLinear): # linear layer
                        record[l+1, :, i] = layer.forward_at(record[l], i)
                    else: # elementwise layer
                        record[l+1, :, i] = layer(record[l, :, i])
                record[0, :, i] = dist.Bernoulli(record[-1, :, i]).sample()
        return record[0]

### Architecture

The conditional probabilities are modeled as Bernoulli distributions, whose probabilities are given by the autoregressive feed forward neural network. To respect the causal structure, the first autoregressive layer must have `diagonal=-1` such that $y_i$ depends on $(x_1,\cdots,x_{i-1})$. To esure that the output are probabilities (real numbers between 0 and 1), the last layer nonlineary activation must be Sigmoid. The internal nonlinear activation can be specified freely (default: ReLU). The architecture of a depth-3 autoregressive model will be like:

In [11]:
AutoregressiveModel(5, depth=3)

AutoregressiveModel(
  (features): 5
  (layers): ModuleList(
    (0): AutoregressiveLinear(in_features=5, out_features=5, bias=True, diagonal=-1)
    (1): ReLU()
    (2): AutoregressiveLinear(in_features=5, out_features=5, bias=True, diagonal=0)
    (3): ReLU()
    (4): AutoregressiveLinear(in_features=5, out_features=5, bias=True, diagonal=0)
    (5): Sigmoid()
  )
)

### Bernoulli Distribution

The forward pass will calculate the Bernoulli probability parameter: 
$$p_i = f(x_1,\cdots,x_{i-1}).$$ 
The Bernoulli samples are binary (0 or 1), where $x_i=1$ with probability $p_i$ and $x_i=0$ with probability $1-p_i$. The log conditional probability is given by
$$\log p(x_i|x_1,\cdots,x_{i-1})= x_i\log p_i + (1-x_i)\log(1-p_i).$$
The log probability of the autoregressive model is given by the summation
$$\log p(x_1,x_2,\cdots) = \sum_{i=1}^n\log p(x_i|x_1,\cdots,x_{i-1}).$$
The nice thing about PyTorch is that it already have a distribution class `distributions.Bernoulli`, which has the `log_prob` and `sample` methods to evalute the log probability and to generate samples. We can just call them to implement the `log_prob` and `sample` methods for our `AutoregressiveModel`.

Here is how we evaluate the log probability:

In [12]:
AutoregressiveModel(5, depth=3).log_prob(torch.tensor([0.,1.,0.,0.,0.]))

tensor(-3.6187, grad_fn=<SumBackward1>)

Here is how we sample:

In [13]:
AutoregressiveModel(5, depth=3).sample()

tensor([[0., 0., 1., 1., 0.]])

We can generate a batch of samples by specifying a batch size. All samples will be generated in parallel.

In [26]:
AutoregressiveModel(5, depth=3).sample(4)

tensor([[0., 1., 0., 1., 1.],
        [1., 1., 0., 0., 0.],
        [0., 1., 0., 1., 0.],
        [1., 0., 1., 1., 0.]])

### Toy Model: Ideal Ferromagets

Frist generate some fake training data

In [29]:
data = torch.tensor([
    [1., 1., 1., 1., 1.]
])

Create a model. Initially, the generative model just do random sampling.

In [30]:
model = AutoregressiveModel(5, depth=1)
model.sample(5)

tensor([[0., 0., 1., 1., 1.],
        [1., 1., 0., 0., 1.],
        [0., 0., 0., 1., 0.],
        [0., 1., 1., 0., 0.],
        [0., 0., 0., 0., 0.]])

Train with the dataset use the negative log likelihood loss.

In [31]:
optimizer = optim.Adam(model.parameters(), lr=0.5)
train_loss = 0.
for epoch in range(500):
    loss = -torch.mean(model.log_prob(data))
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    train_loss += loss.item()
    if (epoch+1)%100 == 0:
        print('loss : {:.4f}'.format(train_loss / 100))
        train_loss = 0.

loss : 0.0719
loss : 0.0005
loss : 0.0003
loss : 0.0002
loss : 0.0002


The model learns to generate all 1's.

In [32]:
model.sample(5)

tensor([[1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.]])

The trick is to make the bias very positive. Such that, regardless of what the configuration is, the last layer will always output a probability close to 1, hence the sampler will be strongly biased towards 1.

In [33]:
list(model.parameters())

[Parameter containing:
 tensor([[ 0.2173,  0.3992,  0.0467,  0.3007, -0.3735],
         [ 5.3721, -0.2386,  0.3877, -0.1795, -0.1578],
         [ 4.1606,  4.2976,  0.0204,  0.1498, -0.3951],
         [ 3.6092,  3.9985,  3.6749, -0.0718,  0.0917],
         [ 3.7416,  3.1544,  3.2289,  3.3069, -0.3318]], requires_grad=True),
 Parameter containing:
 tensor([8.9328, 5.1821, 4.2918, 3.9373, 3.9188], requires_grad=True)]

Let us try a bit more challenging dataset, which are either all-1 or all-0, meaning that the spins are correlated together. The neural network must learn to capture the correlation in order to model the dataset correctly. The previous bias trick would not work. 

In [34]:
data = torch.tensor([
    [1., 1., 1., 1., 1.],
    [0., 0., 0., 0., 0.]
])
model = AutoregressiveModel(5, depth=1)

Train it.

In [35]:
optimizer = optim.Adam(model.parameters(), lr=1.)
train_loss = 0.
for epoch in range(500):
    loss = -torch.mean(model.log_prob(data))
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    train_loss += loss.item()
    if (epoch+1)%100 == 0:
        print('loss : {:.4f}'.format(train_loss / 100))
        train_loss = 0.

loss : 0.8039
loss : 0.6938
loss : 0.6936
loss : 0.6935
loss : 0.6934


The loss does not go to zero in this case, because it is always lower-bounded by the *entropy* of the dataset, which is $\ln 2\approx 0.6931$ in this case. So the loss has converged to its theoretical minimum. As one can see below, the neural network successfully learns how to generate correlated samples as well.

In [24]:
model.sample(5)

tensor([[1., 1., 1., 1., 1.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.]])

The trick is to make the weight matrix very positive in the lower-triangle to mediate the strong correlation among spins. The first bias vanishes to ensure unbiased sampling between all-1 and all-0.

In [25]:
list(model.parameters())

[Parameter containing:
 tensor([[ 1.8387e-01,  5.0940e-02, -2.4698e-01,  2.2660e-01, -1.6552e-02],
         [ 1.7140e+01,  2.8545e-01,  3.0627e-01, -3.6814e-02,  9.8888e-02],
         [ 9.4619e+00,  9.5402e+00, -4.1717e-01,  3.6980e-01,  3.0317e-02],
         [ 7.5032e+00,  8.0729e+00,  7.5546e+00,  2.7915e-01, -8.6019e-02],
         [ 6.3904e+00,  6.8154e+00,  6.6266e+00,  6.7329e+00,  1.1359e-01]],
        requires_grad=True), Parameter containing:
 tensor([ 4.1628e-07, -8.3664e+00, -9.1003e+00, -1.1455e+01, -1.1585e+01],
        requires_grad=True)]

# Ising Model

## Problem and Algorithm
### Transfer Matrix Approach
2D Ising model defined on a square lattice is described by the energy model
$E=-\sum_{\langle i j\rangle}S_i S_j$.
The Ising variable $S_i=\pm 1$ can be equvalently expressed as the binary variable $x_i=0,1$, given $S_i=(-1)^{x_i}$. Then the energy functional can be written as
$$E[x]=-\sum_{\langle i j\rangle}(-1)^{x_i+x_j}=\sum_{\langle i j\rangle} \epsilon(x_i,x_j).$$
We have defined the energy function $\epsilon(x_i,x_j)\equiv -(-1)^{x_i+x_j}$ on each bond $\langle ij\rangle$.  The partition function can be evaluted by the transfer matrix approach
$$Z=\sum_{[x]}e^{-\beta E[x]}=\mathrm{Tr}\;T^\infty,$$
where the transfer matrix element is given by
$$\begin{split}
&T[x',x]=e^{-\beta (E_h[x]+E_v[x',x])},\\
&E_h[x]=\sum_{i}\epsilon(x_i,x_{i+1}),\quad E_v[x',x]=\sum_i\epsilon(x_i,x'_i).
\end{split}$$
Here $x$ and $x'$ denotes the 1D configurations of the binary variables across a layer.

<img src="./image/comb.png" alt="comb" style="width: 300px;"/>

To solve the statistical mechanical problem is to find the leading eigenstate of the transfer matrix, which can be obtained by iteratively applying the transfer matrix to an artibary initial state. Because the transfer matrix is positive definite, its leading eigenstate will also be positive definite, which can then be viewed as  a probability distribution $p[x]$ of the 1D configuration $x$. We propose to model the distribution by the autoregessive generative model.

### Teaching-Learning Algorithm

The iterative approach to find the leading eigenstate can be interprete as a recurrent teaching-learning process. Appling the trasfer matrix to the state corresponds to reweighting the probability distribution
$$p[x]\to p'[x']\propto\sum_{[x]}T[x',x]p[x].$$
To represent this process, we can introduce two generative models: one being a teacher and the other being a student. The teacher generates samples $x$ according to the teacher distribution $p_\text{tch}[x]$. Each sample $x$ will be deformed to a new sample $x'$ with the reweighting factor $T[x',x]$ (the protocol will be detailed later), such the new samples will follow a data distribution $$p_\text{dat}[x']=Z_\text{dat}^{-1}\sum_{[x]}T[x',x]p_\text{tch}[x],$$ where $Z_\text{dat}=\sum_{[x']}\sum_{[x]}T[x',x]p_\text{tch}[x]$. The student then learns from the deformed set of samples to establish a student distribution $p_\text{std}[x']$ to approximate $p_\text{dat}[x']$. The objective is to minimize the KL divergence by training the student,
$$\begin{split}
\mathcal{L}&=\mathsf{KL}(p_\text{dat}||p_\text{std})\\
&=\sum_{[x']}p_\text{dat}[x']\log \frac{p_\text{dat}[x']}{p_\text{std}[x']}\\
&=-\sum_{[x']}p_\text{dat}[x']\log p_\text{std}[x'] - H(p_\text{dat}).
\end{split}$$
The last term $H(p_\text{dat})=-\sum_{[x']}p_\text{dat}[x']\log p_\text{dat}[x']$ is the entropy associated with the data distribution, which can be dropped from the loss function, as it is independent of the parameters of the student (which we aim to train). As the student learns to model the data distribution, it will replace the teacher to teach the next generation of the student. As this teaching-learning process goes recurrently, the generative model (both the teacher and the student) is expected to converge to the leading eigenstate of the transfer matrix $T[x',x]$. In practice, we do not really need to train the student to convergence in each step. Since our goal is to equilibrate to the final steady distribtuion which is immune to the deformation imposed by the transfer matrix $T[x',x]$, we can generally mix the training and the iteration procedure to gether. 

However the summation of all samples $[x']$ will be imposible to evaluate directly, we want to replace it by sampling. This is made possible as all transfer matrix elements $T[x',x]$ are positive, which allows us to normalize $T[x',x]$ into a conditioinal
$$p[x'|x]=\frac{T[x',x]}{T[x]},\quad T[x]=\sum_{[x']}T[x',x].$$
The summation of $[x']$ in $T[x]$ can now be evaluated locally because the components of $x'$ are uncorrelated once $x$ is pinned.
$$\begin{split}
T[x]&=\sum_{[x']}e^{-\beta (E_h[x]+E_v[x',x])}\\
&=e^{-\beta E_h[x]}\prod_{i}\sum_{x_i}e^{-\beta\epsilon(x_i,x_{i}')}\\
&=(2\cosh\beta)^N e^{-\beta E_h[x]}.
\end{split}$$
Then the loss function can be estimated by sampling
$$\begin{split}
\mathcal{L}&=-\sum_{[x']}p_\text{dat}[x']\log p_\text{std}[x']\\
&=-Z_\text{dat}^{-1}\sum_{[x']}\sum_{[x]}T[x',x]p_\text{tch}[x]\log p_\text{std}[x']\\
&=-Z_\text{dat}^{-1}\sum_{[x']}\sum_{[x]}p[x'|x]p_\text{tch}[x]T[x]\log p_\text{std}[x']\\
&=-\sum_{x\sim p_\text{tch}[x]}\sum_{x'\sim p[x'|x]}\bar{T}[x]\log p_\text{std}[x'],
\end{split}$$
where we have absorb $Z_\text{dat}^{-1}$ in to $T[x]$ to define the normalized reweighting factor
$$\begin{split}
\bar{T}[x]&=Z_\text{dat}^{-1}T[x]=\frac{T[x]}{\sum_{x\sim p_\text{tch}[x]}T[x]}\\
&=\mathsf{softmax}\big(-\beta E_h[x]\big).
\end{split}$$
The sampling $x\sim p_\text{tch}[x]$ is provided by the teacher machine. Given $x$, the conditional sampling $x'\sim p[x'|x]$ can be done indepdently on each site, because $p[x'|x]$ factorizes as
$$p[x'|x]=\prod_{i}p(x'_i|x_i)=\prod_i \mathsf{Bernoulli}\Big(x'_i\Big|p_i=\frac{e^{-\beta(-1)^{x_i}}}{2\cosh\beta}\Big).$$

**Summary of algorithm**: repeat the following
- call the autoregressive model to sample $x$ from $p[x]$,
- deform $x$ to $x'$ according to $p[x'|x]$,
- evaluate $E_h[x]$ and softmax to the normalized reweighting factor $\bar{T}[x]$,
- evaluate $\log p[x']$ by the autoregressive model
- construct the loss as the weighted average of the negative log likelihood,
- gradient descent to update the model parameters.

### Implementation

In [120]:
system_size = 8
batch_size = 50
beta = torch.tensor(1.)
model = AutoregressiveModel(system_size, depth=1)
optimizer = optim.Adam(model.parameters(), lr=1.)

In [127]:
train_loss = 0.
for epoch in range(500):
    x = model.sample(batch_size)
    xp = dist.Bernoulli(torch.exp(beta * (2*x - 1))/(2*torch.cosh(beta))).sample()
    weight = F.softmax(beta * torch.sum((2*x - 1) * (2*torch.roll(x,1,-1) - 1), axis=-1), dim=-1)
    loss = - torch.dot(weight, model.log_prob(xp))
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    train_loss += loss.item()
    if (epoch+1)%100 == 0:
        print('loss : {:.4f}'.format(train_loss / 100))
        train_loss = 0.

loss : 3.0632
loss : 3.1292
loss : 3.1386
loss : 3.2341
loss : 3.1658


In [129]:
list(model.parameters())

[Parameter containing:
 tensor([[ 0.2839, -0.2881, -0.0459,  0.1264,  0.2709,  0.0937, -0.3106, -0.0176],
         [ 0.4038, -0.1238,  0.0674,  0.0220, -0.0936,  0.2206, -0.0978,  0.1528],
         [-0.1443, -0.2464, -0.3222, -0.2524, -0.3215,  0.2623,  0.0137,  0.1410],
         [-0.4081, -0.2061, -0.2841, -0.1977, -0.0361, -0.3329,  0.3470, -0.1668],
         [-0.1064,  0.3212, -0.3019,  0.1854,  0.0039,  0.2917, -0.2759,  0.0544],
         [ 0.6532, -0.3877, -0.5176,  0.2307, -1.2573,  0.0459,  0.1892,  0.1136],
         [-0.7275, -0.0952,  0.4583,  0.2729, -0.6448,  0.9197, -0.1680, -0.3020],
         [ 0.9403,  0.0281,  0.3529, -0.2006, -0.0541,  0.1504, -0.8032,  0.0806]],
        requires_grad=True), Parameter containing:
 tensor([-2.4687, -1.8263, -2.5109, -1.3041, -1.6607, -1.8299, -2.5613, -1.5085],
        requires_grad=True)]

The sampling approach suffers from a serious mode collapse. Possible reasons:
- $x'$ is sampled based on $x$ from $p[x'|x]$, such that the statistical bias in $p_\text{tch}[x]$ will be inherited and strenthen for the student,
- $H(p_\text{dat})$ is dropped in the loss function, such that the teacher is not penalized for its biased views,
- no gradient signal is passing to the teacher, the teacher and the student are not trained in a adversarial manner.

#### Attempt 1. Stop sampling $x'$ from $x$

Bias is reduced, but does not seem to improve much.

In [137]:
system_size = 8
batch_size = 50
beta = torch.tensor(5.)
model = AutoregressiveModel(system_size, depth=1)
optimizer = optim.Adam(model.parameters(), lr=1.)

In [140]:
train_loss = 0.
for epoch in range(500):
    x = model.sample(batch_size)
    xp = dist.Bernoulli(0.5*torch.ones_like(x)).sample()
    Eh = torch.sum((2*x - 1) * (2*torch.roll(x,1,-1) - 1), axis=-1)
    Ev = torch.sum((2*x - 1) * (2*xp - 1), axis=-1)
    weight = F.softmax(beta * (Eh + Ev), dim=-1)
    loss = - torch.dot(weight, model.log_prob(xp))
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    train_loss += loss.item()
    if (epoch+1)%100 == 0:
        print('loss : {:.4f}'.format(train_loss / 100))
        train_loss = 0.

loss : 8.8025
loss : 7.5321
loss : 9.3219
loss : 7.7680
loss : 8.1940


In [142]:
model.sample(5)

tensor([[1., 1., 0., 0., 0., 0., 1., 1.],
        [1., 0., 0., 1., 0., 0., 1., 1.],
        [1., 1., 0., 0., 0., 0., 1., 1.],
        [0., 1., 0., 0., 0., 0., 1., 1.],
        [1., 0., 0., 0., 0., 0., 1., 1.]])

In [141]:
list(model.parameters())

[Parameter containing:
 tensor([[ 3.1932e-01,  1.9394e-01,  2.3532e-01, -3.1796e-01,  3.3664e-01,
           2.8224e-01,  9.1710e-02,  1.5669e-01],
         [-1.3078e+00, -1.8937e-01, -2.7857e-01,  1.3384e-01, -1.7499e-01,
           1.8454e-01,  1.1227e-01,  6.7450e-02],
         [-3.7059e+00, -2.8085e+00,  1.0216e-02,  2.2641e-01,  1.4878e-01,
           3.2933e-01,  2.3652e-02, -1.1739e-02],
         [ 1.0527e+00, -2.0240e+00, -3.1296e+00,  6.2482e-02, -2.6387e-01,
           6.5273e-02,  1.8213e-01,  3.1659e-01],
         [-8.2095e-01,  1.7295e+00, -5.2178e-01, -1.6847e+00,  8.6046e-02,
           2.4860e-01,  2.9677e-01, -3.3340e-01],
         [-2.2075e+00, -1.0281e+00,  1.1365e+00, -1.1479e+00, -5.0758e-01,
           1.2924e-01,  1.8656e-01,  2.1082e-01],
         [ 6.5598e+00,  7.6392e+00, -4.5960e-01, -5.4670e+00, -6.0558e+00,
          -5.5042e+00, -2.0659e-01,  5.9761e-02],
         [ 1.6007e+01, -3.0684e+00,  6.2235e-01, -8.1058e+00,  1.3110e+01,
           4.7536e+00,  4.6

## One-Hot Embedding and Gumbel Sampling

In [20]:
%run 'model.py'
model = AutoregressiveModel(5, depth=3)
model.rsample(3)

tensor([[0.3128, 0.6872],
        [0.2740, 0.7260],
        [0.1106, 0.8894]], grad_fn=<SoftmaxBackward>)
tensor([[0.0054, 0.9946],
        [0.1542, 0.8458],
        [0.1772, 0.8228]], grad_fn=<SoftmaxBackward>)
tensor([[0.3051, 0.6949],
        [0.4567, 0.5433],
        [0.9316, 0.0684]], grad_fn=<SoftmaxBackward>)
tensor([[0.6841, 0.3159],
        [0.3236, 0.6764],
        [0.3363, 0.6637]], grad_fn=<SoftmaxBackward>)
tensor([[0.5702, 0.4298],
        [0.3915, 0.6085],
        [0.0899, 0.9101]], grad_fn=<SoftmaxBackward>)


tensor([[1., 0., 0., 1., 1.],
        [0., 0., 1., 1., 1.],
        [0., 1., 1., 0., 1.]], grad_fn=<SelectBackward>)