In [89]:
%run 'model.py'

# Reverse KL Model

## Theoretical Framework

### Problem Statement

The goal of this project is to develop a machine learning algorithm to simulate statistical mechanical models. The statistical mechanical problem is given by its **transfer matrix** $T(x,x')$, which describes the Boltzmann weight associated with the configurations $x$ and $x'$ across a single layer. Given $T(x,x')$, the task is to find the **stationary distribution** $p(x)$, s.t.
$$p'(x)\equiv \frac{\sum_{x'}T(x,x')p(x')}{\sum_{x'}T(x')p(x')} \to p(x),$$
where $T(x')=\sum_{x}T(x,x')$ is marginalized over $x$ configurations.

### Proposed Approach

Introduce an **autoregressive generative model** to represent the probability distribtuion $p(x)$. Finding the stationary distribution amounts to minimize the *KL divergence* between $p'$ and $p$,
$$\begin{split}
\mathcal{L}&=\mathsf{KL}(p||p')=\sum_{x}p(x)\ln\frac{p(x)}{p'(x)}\\
&=\sum_{x}p(x)\Big(\ln p(x)-\ln\sum_{x'}T(x,x')p(x')+\ln\sum_{x'}T(x')p(x')\Big)\\
&=\sum_{x\sim p}\Big(\ln p(x)-\ln\sum_{x'\sim p}T(x,x')+\ln\sum_{x'\sim p}T(x')\Big).
\end{split}$$
The loss function is calculated by **dual sampling** from both sides of the transfer matrix. The sampling must be *reparameterized* to enable gradient back propagation. Since the sampling is descrete, the **Gumbel sampling** technique should be applied.

### One-Hot Encoding

To work with Gumbel sampling, the configurations must be **one-hot** encoded. For $x=(x_1,x_2,\cdots,x_N)$, each $x_i$ is now encoded as a vector
$$\uparrow=(1,0),\quad \downarrow=(0,1).$$
This allows us to extend our model to more general statistical mechanics problems where the on-site degree of freedom has more than two states. However the challenge is to implement the autoregressive linear transformation that works with multiple internal features.

## Multi-Feature Autoregressive Linear Layer

### Masked Weight

With multiple features, we can no longer simply call `torch.tril` to construct the lower-triangular matrix. We need to create a a mask explicit. The stretegy is to first create a four-way tensor by tensor dot the lower-triangular matrix with the block matrix of all-one. Then the tensor is transposed and reshaped to the matrix form.

In [26]:
(units,in_features,out_features)=(4,2,3)
mask = torch.tensordot(torch.tril(torch.ones(units, units), diagonal=-1), torch.ones(out_features, in_features), dims=0).transpose(1,2).reshape(units*out_features, units*in_features)
mask

tensor([[0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 1., 1., 0., 0.]])

Check that mask storage is contiguous.

In [27]:
mask.is_contiguous()

True

Create bare weight matrix.

In [39]:
weight = torch.ones(units*out_features, units*in_features, requires_grad=True)
weight

tensor([[1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.]], requires_grad=True)

Mask the weight matrix with the mask.

In [40]:
masked_weight = mask * weight
masked_weight

tensor([[0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 1., 1., 0., 0.]], grad_fn=<MulBackward0>)

The masked weight matrix can be used in the loss function. The gradient signal is also masked automatically.

In [37]:
loss = (masked_weight).norm()**2
loss.backward()
weight.grad

tensor([[0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [2., 2., 0., 0., 0., 0., 0., 0.],
        [2., 2., 0., 0., 0., 0., 0., 0.],
        [2., 2., 0., 0., 0., 0., 0., 0.],
        [2., 2., 2., 2., 0., 0., 0., 0.],
        [2., 2., 2., 2., 0., 0., 0., 0.],
        [2., 2., 2., 2., 0., 0., 0., 0.],
        [2., 2., 2., 2., 2., 2., 0., 0.],
        [2., 2., 2., 2., 2., 2., 0., 0.],
        [2., 2., 2., 2., 2., 2., 0., 0.]])

### Implementation

In [972]:
class AutoregressiveLinear(nn.Linear):
    """ Applies a lienar transformation to the incoming data, 
        with the weight matrix masked to the lower-triangle. 
        
        Args:
        in_features: size of each input sample
        out_features: size of each output sample
        bias: If set to ``False``, the layer will not learn an additive bias.
            Default: ``True``
        diagonal: the diagonal to trucate to"""
    
    def __init__(self, units, in_features, out_features, bias=True, diagonal=0):
        super(AutoregressiveLinear, self).__init__(units*in_features, units*out_features, bias)
        self.units = units
        self.in_features = in_features
        self.out_features = out_features
        self.diagonal = diagonal
        self.mask = torch.tensordot(torch.tril(torch.ones(units, units), diagonal), torch.ones(out_features, in_features), dims=0).transpose(1,2).reshape(units*out_features, units*in_features)

    
    def extra_repr(self):
        return 'unites={}, in_features={}, out_features={}, bias={}, diagonal={}'.format(self.units, self.in_features, self.out_features, not self.bias is None, self.diagonal)
    
    # overwrite forward pass
    def forward(self, input):
        return F.linear(input, self.mask * self.weight, self.bias)
    
    def forward_at(self, input, i):
        # pick out the weight block that is active
        active_weight = self.weight.narrow(0, i*self.out_features, self.out_features) # narrow out the rows
        active_weight = active_weight.narrow(1, 0, (i + 1 + self.diagonal)*self.in_features) # narrow out the columns
        # pick out the input block that is active
        active_input = input.narrow(-1, 0, (i + 1 + self.diagonal)*self.in_features)
        # transform active input by active weight
        output = active_input.matmul(active_weight.t())
        if self.bias is not None: # if bias exists, add it
            output += self.bias.narrow(0, i*self.out_features, self.out_features)
        return output

Create an autoregrssive linear layer.

In [409]:
al = AutoregressiveLinear(4,2,3, diagonal=-1)

Forward at a particular unit.

In [410]:
data = torch.rand([5,8])
al.forward_at(data, 1)

tensor([[ 0.0791, -0.0960, -0.5842],
        [ 0.1090, -0.1339, -0.6587],
        [-0.1859, -0.0595, -0.6251],
        [ 0.0702, -0.1277, -0.6627],
        [-0.0785, -0.0822, -0.6271]], grad_fn=<AddBackward0>)

Test gradient signals for unit-targeted forward pass. Gradient signals are masked automatically.

In [414]:
al.zero_grad()
loss = al.forward_at(data, 2).norm()**2
loss.backward()
[p.grad for p in al.parameters()]

[tensor([[ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000],
         [-0.1247, -0.4670, -0.4633, -0.3792,  0.0000,  0.0000,  0.0000,  0.0000],
         [ 2.5142,  1.7494,  2.9878,  1.3875,  0.0000,  0.0000,  0.0000,  0.0000],
         [ 1.0165,  1.0129,  1.4145,  0.9168,  0.0000,  0.0000,  0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000]]),
 t

## Autoregressive Model

### Implementation

- The implementation work with one-hot encodings of spin configurations. Input sample is of the size (batch size, unit size x feature size) $x_{i\alpha}^{(n)}$: $n$ - sample index, $i$ - unit index, $\alpha$ - feature index.
- On the evaluation side, log probability is evaluated as 
$$\log p(x)=\sum_{i,\alpha}x_{i\alpha}\log p_{i\alpha},$$
which generalize to the case when $x$ is soft.
- On the sampling side, $\log p_{i\alpha}$ will be calculated autoregressively and $x_{i\alpha}$ is sampled by
$$x_{i\alpha}\sim \frac{\exp((\log p_{i\alpha}+g_{i\alpha})/\tau)}{\sum_{\beta}\exp((\log p_{i\beta}+g_{i\beta})/\tau)},$$
where $g_{i\alpha}$ is independently sampled from Gumbel distribution and $\tau$ is the temperature parameter to control the softness.

In [1000]:
class AutoregressiveModel(nn.Module):
    """ Represent a generative model that can generate samples and provide log probability evaluations.
        
        Args:
        units: number of units in the model
        features: a list of feature dimensions from the input layer to the output layer
        nonlinearity: activation function to use """
    
    def __init__(self, units, features, nonlinearity='ReLU'):
        super(AutoregressiveModel, self).__init__()
        self.units = units
        self.features = features
        if features[0] != features[-1]:
            raise ValueError('In features {}, the first and last feature dimensions must be equal.'.format(features))
        self.layers = nn.ModuleList()
        for l in range(len(features)-1):
            if l == 0: # first autoregressive linear layer must have diagonal=-1
                layer = AutoregressiveLinear(units, features[0], features[1], bias=False, diagonal=-1)
                #layer.bias.requires_grad = False
                #layer.bias.data.fill_(-1./features[0])
                #layer.weight.data.fill_(0.)
                self.layers.append(layer)
            else: # remaining autoregressive linear layers have diagonal=0 (by default)
                self.layers.append(getattr(nn, nonlinearity)())
                layer = AutoregressiveLinear(units, features[l], features[l+1], bias=False)
                #layer.weight.data.fill_(0.)
                self.layers.append(layer)
    
    def extra_repr(self):
        return '(units): {}\n(features): {}'.format(self.units, self.features) + super(AutoregressiveModel, self).extra_repr()
    
    def forward(self, input):
        logits = input # logits as a workspace, initialized to input
        for layer in self.layers: # apply layers
            logits = layer(logits)
        return logits # logits output
    
    def log_prob(self, input):
        logits = self(input).view(-1, self.units, self.features[-1]) # forward pass to get logits
        input = input.view(-1, self.units, self.features[0])
        return torch.sum(F.softmax(logits, dim=-1).log() * input, (-2,-1))
        
    def _xsample(self, batch_size, tau, hard):
        # create a list to host layer-wise outputs
        record = [torch.empty(batch_size, 0) for _ in self.features]
        # autoregressive batch sampling
        for i in range(self.units):
            for l in range(len(self.features)-1):
                if l==0: # first linear layer
                    output = self.layers[0].forward_at(record[0], i)
                else: # remaining layers
                    output = self.layers[2*l-1](output) # element-wise layer
                    record[l] = torch.cat([record[l], output], dim=-1) # concatenate output to record
                    output = self.layers[2*l].forward_at(record[l], i)
            # record[-1] = torch.cat([record[-1],  output], dim=-1) # for debug purpose
            # the last output hosts logits, sample by Gumbel softmax 
            sample = F.gumbel_softmax(output, tau, hard)
            record[0] = torch.cat([record[0], sample], dim=-1) # concatenate sample to record
        return record
    
    def rsample(self, batch_size=1, tau=None, hard=False):
        if tau is None: # if temperature not given
            tau = 1/(self.features[-1]-1) # set by the out feature dimension
        return self._xsample(batch_size, tau, hard)[0]
    
    def sample(self, batch_size=1, tau=None, hard=False):
        with torch.no_grad(): # no gradient for sample generation
            return self.rsample(batch_size, tau, hard)

Evaluate log probability.

In [448]:
model = AutoregressiveModel(4, [2, 3, 2])
data = torch.bernoulli(0.5*torch.ones([5,4]))
data = torch.stack([data,1-data]).permute(1,2,0).reshape(5,8)
print(data)
model.log_prob(data)

tensor([[1., 0., 0., 1., 0., 1., 1., 0.],
        [1., 0., 1., 0., 0., 1., 1., 0.],
        [1., 0., 1., 0., 0., 1., 0., 1.],
        [1., 0., 1., 0., 1., 0., 0., 1.],
        [1., 0., 0., 1., 0., 1., 0., 1.]])


tensor([-2.4279, -2.3005, -2.8079, -3.3546, -3.0945], grad_fn=<SumBackward1>)

Reparametrized sampling by Gumbel softmax. Hard samples and soft samples.

In [449]:
model.rsample(5, hard=True)

tensor([[1., 0., 0., 1., 1., 0., 1., 0.],
        [0., 1., 0., 1., 1., 0., 1., 0.],
        [0., 1., 0., 1., 1., 0., 0., 1.],
        [0., 1., 1., 0., 1., 0., 0., 1.],
        [1., 0., 0., 1., 1., 0., 1., 0.]], grad_fn=<CatBackward>)

In [450]:
model.rsample(5, hard=False)

tensor([[0.0236, 0.9764, 0.7538, 0.2462, 0.2681, 0.7319, 0.7210, 0.2790],
        [0.8046, 0.1954, 0.7568, 0.2432, 0.4118, 0.5882, 0.3657, 0.6343],
        [0.5505, 0.4495, 0.3652, 0.6348, 0.9307, 0.0693, 0.8682, 0.1318],
        [0.9299, 0.0701, 0.5247, 0.4753, 0.2937, 0.7063, 0.8528, 0.1472],
        [0.4403, 0.5597, 0.6384, 0.3616, 0.3969, 0.6031, 0.9476, 0.0524]],
       grad_fn=<CatBackward>)

**Issue to Address**: 
- How to select the temperature parameter $\tau$? Will it alter the universality class?
- How do we pin the critical point? Can we enforce criticality by imposing the duality as a symmetry?
- How does the computational complexity scales with system size and state space dimension?

## Statistical Mechanics System

### Framework

- **Bond Weight**. The statistical mechanics model is defined by the statistical weight between two spins. As the spin state is one-hot encoded, the statistical weight can be represented as a matrix. For Ising model
$$W(\beta)=\begin{bmatrix}e^{\beta} & e^{-\beta}\\ e^{-\beta} & e^{\beta}\end{bmatrix}.$$

- **Transfer Weight**. Transfer matrix element $T(x,x')$, evaluated as
$$T(x,x')=\prod_{i}(x'_i)^\intercal W x'_{i+1}\prod_{i}x_i^\intercal W x'_i.$$

- **Marginalized Transfer Weight**. Marginalize over $x$, $T(x')=\sum_x T(x,x')$, evaluated as
$$T(x,x')=\prod_{i}(x'_i)^\intercal W x'_{i+1}\prod_{i} w^\intercal x'_i,$$
where $w_\beta=\sum_{\alpha}W_{\alpha\beta}$.

The class `StatMechSystem` evaluates $T(x,x')$ and $T(x')$ given the bond weight matrix $W$.

In [973]:
class StatMechSystem(nn.Module):
    ''' Provide evaluation for the transfer weight and its marginalization.
        
        Args:
        units: number of units in the model (system size)
        bond_weight: bond weight matrix'''
    
    def __init__(self, units, bond_weight):
        super(StatMechSystem, self).__init__()
        self.units = units
        self.W = bond_weight
        self.w = bond_weight.sum(0)
        self.states = len(bond_weight)
        
    def forward(self, *xs):
        # receive configurations and view in tensor form
        x = None
        if len(xs) == 1:
            xp = xs[0].view(1, -1, self.units, self.states)
        elif len(xs) == 2:
            x = xs[0].view(-1, 1, self.units, self.states)
            xp = xs[1].view(1, -1, self.units, self.states)
        else:
            raise ValueError('Expect 1 or 2 arguments. Get {} arguments.'.format(len(xs)))
        # compute the horizontal product
        Th = torch.prod(torch.sum(xp.matmul(self.W) * xp.roll(1, -2), -1), -1)
        # compute the vertical product
        if x is None:
            Tv = torch.prod(xp.matmul(self.w), -1)
        else:
            Tv = torch.prod(torch.sum(x.matmul(self.W) * xp, -1), -1)
        return Th * Tv

Define a function `Ising` to compute the bond weight matrix for Ising model at inverse temperature $\beta$.

In [337]:
def Ising(beta):
    return torch.exp(beta * torch.tensor([[1.,-1.],[-1.,1.]]))
Ising(1.)

tensor([[2.7183, 0.3679],
        [0.3679, 2.7183]])

### Construct Loss Function

Now we have all ingredients ready to construct the loss function
$$\mathcal{L}=\sum_{x\sim p}\bigg(\ln p(x)-\ln\frac{\sum_{x'\sim p}T(x,x')}{\sum_{x'\sim p}T(x')}\bigg).$$

- Create an **autoregressive model**.

In [974]:
system_size = 4
model = AutoregressiveModel(system_size, [2, 3, 2])

- Setup the **statistical mechanical system** of the same system size for Ising model at a given inverse temperature.

In [975]:
T = StatMechSystem(system_size, Ising(1.))

- **Dual sampling**. Draw samples for $x$ and $x'$ independently from the autoregressive model. Each row is a sample. Each sample is of the size system_size x 2, which can be further partitioned into list of two-component state vectors.

In [969]:
batch_size = 5
x = model.rsample(batch_size, hard=True)
xp = model.rsample(batch_size, hard=True)
print(x)
print(xp)

tensor([[1., 0., 1., 0., 1., 0., 1., 0.],
        [1., 0., 0., 1., 1., 0., 0., 1.],
        [0., 1., 1., 0., 1., 0., 0., 1.],
        [0., 1., 0., 1., 0., 1., 0., 1.],
        [1., 0., 0., 1., 0., 1., 0., 1.]], grad_fn=<CatBackward>)
tensor([[1., 0., 1., 0., 0., 1., 0., 1.],
        [1., 0., 1., 0., 0., 1., 0., 1.],
        [0., 1., 1., 0., 1., 0., 0., 1.],
        [0., 1., 0., 1., 1., 0., 1., 0.],
        [0., 1., 0., 1., 0., 1., 0., 1.]], grad_fn=<CatBackward>)


- **Log probability** is evaluated by the autoregressive model. Result is a vector of batch size. Each element is $\ln p(x)$ for a sample of $x$.

In [964]:
model.log_prob(x)

tensor([-2.7726, -2.7726, -2.7726, -2.7726, -2.7726], grad_fn=<SumBackward1>)

- **Transfer Weight and its Marginalization** are evaluated by the statistical mechanical system. The transfer weight $T(x,x')$ is stored as a matrix for each pair of $(x,x')$. The marignalized transfer weight $T(x')$ is stored as a vector for each sample of $x'$.

In [965]:
print(T(x, xp))
print(T(xp))

tensor([[1.0000e+00, 5.4598e+01, 1.0000e+00, 1.3534e-01, 5.4598e+01],
        [1.0000e+00, 5.4598e+01, 1.0000e+00, 7.3891e+00, 1.8316e-02],
        [1.3534e-01, 4.0343e+02, 7.3891e+00, 1.0000e+00, 7.3891e+00],
        [7.3891e+00, 7.3891e+00, 1.3534e-01, 1.8316e-02, 7.3891e+00],
        [1.0000e+00, 5.4598e+01, 1.0000e+00, 1.3534e-01, 5.4598e+01]],
       grad_fn=<MulBackward0>)
tensor([[  90.7140, 4952.8169,   90.7140,   90.7140,   90.7140]],
       grad_fn=<MulBackward0>)


Assemble these terms to construct the **loss function**. Using PyTorch autogradient to calcualate the differentiation of the loss function with respect to the parameters. They indeed receives the gradient signal as expected.

In [976]:
x = model.rsample(batch_size)
xp = model.rsample(batch_size)
loss = model.log_prob(x).sum() - torch.log(T(x, xp).sum(1)/T(xp).sum(1)).sum()
print('loss: ', loss)
loss.backward()
[para.grad for para in model.parameters()]

loss:  tensor(-0.4376, grad_fn=<SubBackward0>)


[tensor([[ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000],
         [-0.0565, -0.1226,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000],
         [-0.0068, -0.0653,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000],
         [-0.0916, -0.0632,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000],
         [-0.0467, -0.0311, -0.1466,  0.0687,  0.0000,  0.0000,  0.0000,  0.0000],
         [-0.0280, -0.0982, -0.0227, -0.1034,  0.0000,  0.0000,  0.0000,  0.0000],
         [-0.0825, -0.0589, -0.2415,  0.1002,  0.0000,  0.0000,  0.0000,  0.0000],
         [-0.0220,  0.0199, -0.0911,  0.0890, -0.0802,  0.0781,  0.0000,  0.0000],
         [-0.0053,  0.0037, -0.0221,  0.0204, -0.0198,  0.0181,  0.0000,  0.0000],
         [-0.0274,  0.0226, -0.1176,  0.1129, -0.1043,  0.0996,  0.0000,  0.0000]]),
 t

### Test

**Observations**:
- Model still suffers from mode collapse.
- Gradient signal has a large variance. 

**Possible Problems**: the $\ln\sum_{x'\sim p}T(x,x')$ and $\ln\sum_{x'\sim p}T(x')$ terms can not be reliably estimated because of the large fluctuation of $T$. The design of the loss function need to be improved. The ensemble average should be moved outof the logarithm.

In [1001]:
(system_size, batch_size) = (2, 200)
model = AutoregressiveModel(system_size, [2, 10, 10, 2])
T = StatMechSystem(system_size, Ising(1.))
optimizer = optim.Adam(model.parameters(), lr=1.)

In [1007]:
train_loss = 0.
tau = 0.3
for epoch in range(500):
    x = model.rsample(batch_size, tau)
    xp = model.rsample(batch_size, tau)
    loss = model.log_prob(x).sum() - torch.log(T(x, xp).sum(1)/T(xp).sum(1)).sum()
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    train_loss += loss.item()
    if (epoch+1)%100 == 0:
        print('loss: {:.4f}, correlation: {:.4f}'.format(train_loss / 100, torch.prod(model.sample(10000, tau, True).matmul(torch.tensor([[1.,0.],[-1.,0.],[0.,1.],[0.,-1.]])),-1).mean()))
        train_loss = 0.

loss: 7.1227, correlation: -0.0108
loss: 7.0497, correlation: -0.0054
loss: 7.3200, correlation: -0.0038
loss: 7.9111, correlation: -0.0232
loss: 7.1650, correlation: -0.0198


In [982]:
[p for p in model.parameters()]

[Parameter containing:
 tensor([[-0.3940, -0.3310, -0.3502,  0.0706],
         [-0.3174,  0.1132,  0.0914,  0.3068],
         [ 0.4851, -0.3601, -0.2903, -0.2764],
         [ 0.1664,  0.0720, -0.4915, -0.4049],
         [ 0.1768, -0.3020,  0.1748,  0.2280],
         [ 0.0121, -0.1255,  0.1490, -0.0668],
         [-0.4389,  0.4183, -0.2526,  0.1294],
         [ 0.4986,  0.4828,  0.4501,  0.3138]], requires_grad=True),
 Parameter containing:
 tensor([[ 0.1909,  0.2095,  0.1673,  0.1185,  0.1551, -0.0034,  0.2860, -0.0545],
         [-0.2146, -0.0631, -0.0016,  0.2494, -0.1496, -0.2477, -0.3377,  0.0333],
         [ 0.3149,  0.1239, -0.0614,  0.2978, -0.2061,  0.2704, -0.0429, -0.2942],
         [ 0.3032,  0.0909,  0.3276,  0.1619, -0.1066,  0.0306,  0.2628,  0.1639]],
        requires_grad=True)]

In [1006]:
model.sample(10, tau, True)

tensor([[1., 0., 0., 1.],
        [0., 1., 0., 1.],
        [0., 1., 0., 1.],
        [0., 1., 0., 1.],
        [0., 1., 0., 1.],
        [0., 1., 0., 1.],
        [1., 0., 1., 0.],
        [0., 1., 0., 1.],
        [0., 1., 0., 1.],
        [0., 1., 1., 0.]])

generative model $p(x)$

$\bar{E} = \sum_x p(x)E(x) =\sum_{x\sim p} E(x)$

Free energy $F = \sum_x p(x)E(x) - T (-\sum_{x} p(x)\ln p(x))=\sum_{x\sim p}(E(x)+ T \ln p(x))$