In [1]:
%run 'autoreg.py'

# Autoregressive Model

Autoregressive model is a generative model that models the joint probability distributition by product of conditional distributions
$$p(x_1,x_2,x_3,\cdots)=p(x_1)p(x_2|x_1)p(x_3|x_1,x_2)\cdots = \prod_{i}p(x_i|x_1,\cdots,x_{i-1}).$$
The parameters of the coditional distributions will be calculated by neural networks. However, due to the autoregressive causal dependence of the conditional probability on the input variables, the neural network should be masked to respect the same causal structure.

## Autoregressive Linear Layer

A key component is to realize an autoregressive linear layer, which maps $x=(x_1,x_2,\cdots)$ to $y=(y_1,y_2,\cdots)$ via
$$y = W\cdot x + b,$$
respecting the causality that $y_i$ only depends on $x_1,\cdots,x_{i-1}$. This can be achieved by requiring the weight matrix $W$ to take a *lower-trianglar* form
$$W = \begin{bmatrix}
0 & 0 & 0 & \cdots & 0\\
W_{21} & 0 & 0 & \cdots & 0\\
W_{31} & W_{32} & 0 & \cdots & 0\\
\vdots & \vdots & \vdots & \ddots & \vdots\\
W_{n1} & W_{n2} & W_{n3} & \cdots & 0\\
\end{bmatrix}$$
For PyTorch realization, we can first greate a raw weight matrix, which is a full matrix. Then construct the actual weight matrix by truncating the full matrix its low-triangle part. This can be implemented by `torch.tril` (which allows gradient backpropagate).

### Toy Example
Create a full weight matrix `w_full` and truncate it to the triangular weight matrix `w_tril`. The function `torch.tril` takes a argument `diagonal` to specify the truncation to which diagonal (inclusively).

In [2]:
w_full = torch.ones(3, 3, requires_grad = True)
w_tril = torch.tril(w_full, diagonal = -1)
print(w_full)
print(w_tril)

tensor([[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]], requires_grad=True)
tensor([[0., 0., 0.],
        [1., 0., 0.],
        [1., 1., 0.]], grad_fn=<TrilBackward>)


We can then use the triangular weight matrix in the remaining computation task to evaluate the loss function. For example, the loss funcion is simply the 2-norm of the triangular weight matrix (just to get some scalar score for the matrix).

In [3]:
loss = torch.sum(w_tril**2)
loss

tensor(3., grad_fn=<SumBackward0>)

Now we can gradient back propagate and check how the raw weight matrix will receive the gradient.

In [4]:
loss.backward()
w_full.grad

tensor([[0., 0., 0.],
        [2., 0., 0.],
        [2., 2., 0.]])

We can see that the gadient is automatically masked as well. The upper triangle does not receive any gradient signal. If we put `w_full` into an optimizer to minimize the loss, the lower triangle of the weight matrix will be trained to zero (as favored by the loss function).

In [5]:
optimizer = optim.Adam([w_full], lr = 0.1)
for epoch in range(500):
    w_tril = torch.tril(w_full, diagonal = -1)
    loss = torch.sum(w_tril**2)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
w_full

tensor([[ 1.0000e+00,  1.0000e+00,  1.0000e+00],
        [-4.1566e-12,  1.0000e+00,  1.0000e+00],
        [-4.1566e-12, -4.1566e-12,  1.0000e+00]], requires_grad=True)

### Pack to Torch Module

We can pack the above functionality to a Torch Module, inherited from `nn.Linear`.

In [6]:
class AutoregressiveLinear(nn.Linear):
    """ Applies a lienar transformation to the incoming data, 
        with the weight matrix masked to the lower-triangle. 
        
        Args:
        in_features: size of each input sample
        out_features: size of each output sample
        bias: If set to ``False``, the layer will not learn an additive bias.
            Default: ``True``
        diagonal: the diagonal to trucate to"""
    
    def __init__(self, in_features, out_features, bias=True, diagonal=0):
        super(AutoregressiveLinear, self).__init__(in_features, out_features, bias)
        self.diagonal = diagonal
    
    def extra_repr(self):
        return super(AutoregressiveLinear, self).extra_repr() + ', diagonal={}'.format(self.diagonal)
    
    # overwrite forward pass
    def forward(self, input):
        return F.linear(input, torch.tril(self.weight, self.diagonal), self.bias)
    
    def forward_at(self, input, i):
        output = input.matmul(torch.tril(self.weight, self.diagonal).narrow(0, i, 1).t())
        if self.bias is not None:
            output += self.bias.narrow(0, i, 1)
        return output.squeeze()

To test this module, let us create some data. The target $y$ is related to the input $x$ by $y_i=\sum_{j=1}^{i}x_j$, which can be modeled by an autoregressive linear transformation, with weight being a lower-triangular matrix with all 1 below diagonal 0, and no bias.

In [7]:
input = torch.randn(10, 5)
target = torch.cumsum(input, axis = 1)
input, target

(tensor([[ 1.2537,  1.2952, -0.7514,  1.6626,  0.2934],
         [-0.8714, -2.4656,  0.7002, -0.4921, -0.5424],
         [ 0.3850, -0.4022, -1.7203, -0.8830,  0.0989],
         [-2.1567,  1.3889, -1.0016,  0.4516,  0.4419],
         [ 0.6539,  1.0505,  0.3454,  1.2798, -0.0401],
         [-0.9391,  0.2523,  0.1364, -1.1801,  1.5261],
         [-0.8578,  0.0364, -0.0223, -0.0495, -1.2166],
         [ 0.4616,  2.1005, -0.1037, -0.9643,  0.7355],
         [ 0.3906, -0.9893,  0.9771, -0.6446, -0.0489],
         [ 1.2490,  1.2904, -1.4693, -1.5764,  0.8918]]),
 tensor([[ 1.2537,  2.5489,  1.7975,  3.4601,  3.7535],
         [-0.8714, -3.3371, -2.6368, -3.1289, -3.6714],
         [ 0.3850, -0.0172, -1.7375, -2.6205, -2.5216],
         [-2.1567, -0.7679, -1.7695, -1.3179, -0.8760],
         [ 0.6539,  1.7044,  2.0497,  3.3296,  3.2895],
         [-0.9391, -0.6868, -0.5505, -1.7306, -0.2045],
         [-0.8578, -0.8213, -0.8436, -0.8931, -2.1097],
         [ 0.4616,  2.5622,  2.4585,  1.4942, 

Supervised learning with mean-square-error (MSE) loss.

In [8]:
model = AutoregressiveLinear(5, 5)
loss_op = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=1.)
train_loss = 0.
for epoch in range(500):
    loss = loss_op(model(input), target)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    train_loss += loss.item()
    if (epoch+1)%100 == 0:
        print('loss : {:.4f}'.format(train_loss / 100))
        train_loss = 0.

loss : 0.1874
loss : 0.0000
loss : 0.0000
loss : 0.0000
loss : 0.0000


As training converges, we inspect the model parameters.

In [9]:
list(model.parameters())

[Parameter containing:
 tensor([[ 1.0000,  0.4060, -0.0309, -0.4228, -0.0073],
         [ 1.0000,  1.0000, -0.3395, -0.3059,  0.4422],
         [ 1.0000,  1.0000,  1.0000,  0.3253, -0.0558],
         [ 1.0000,  1.0000,  1.0000,  1.0000,  0.1358],
         [ 1.0000,  1.0000,  1.0000,  1.0000,  1.0000]], requires_grad=True),
 Parameter containing:
 tensor([-1.4105e-08,  2.4496e-10, -2.0742e-07, -1.4842e-07, -5.9154e-08],
        requires_grad=True)]

The weight matrix indeed becomes a lower-triangular matrix of 1's and the bias indeed vanishes.

## Generative Model

We can use autoregressive linear layers to build the autoregressive model. As a generative model, the autoregressive model must provide two functionalities:
- `log_prob(input)` evaluating the log probability of a batch of samples as `input`,
- `sample(batch_size)` generating a batch of samples given the `batch_size`, according to the model probability distribtuion.

We realize these functionalities in the neural network module `AutoregressiveModel`.

In [10]:
class AutoregressiveModel(nn.Module):
    """ Represent a generative model that can generate samples and provide log probability evaluations.
        
        Args:
        features: size of each sample
        depth: depth of the neural network (in number of linear layers) (default=1)
        nonlinearity: activation function to use (default='ReLU') """
    
    def __init__(self, features, depth=1, nonlinearity='ReLU'):
        super(AutoregressiveModel, self).__init__()
        self.features = features # number of features
        self.layers = nn.ModuleList()
        for i in range(depth):
            if i == 0: # first autoregressive linear layer must have diagonal=-1
                self.layers.append(AutoregressiveLinear(self.features, self.features, diagonal = -1))
            else: # remaining autoregressive linear layers have diagonal=0 (by default)
                self.layers.append(AutoregressiveLinear(self.features, self.features))
            if i == depth-1: # the last layer must be Sigmoid
                self.layers.append(nn.Sigmoid())
            else: # other layers use the specified nonlinearity
                self.layers.append(getattr(nn, nonlinearity)())
    
    def extra_repr(self):
        return '(features): {}'.format(self.features) + super(AutoregressiveModel, self).extra_repr()
    
    def forward(self, input):
        prob = input # prob as a workspace, initialized to input
        for layer in self.layers: # apply layers
            prob = layer(prob)
        return prob # prob holds predicted Beroulli probability parameters
    
    def log_prob(self, input):
        prob = self(input) # forward pass to get Beroulli probability parameters
        return torch.sum(dist.Bernoulli(prob).log_prob(input), axis=-1)
    
    def sample(self, batch_size=1):
        with torch.no_grad(): # no gradient for sample generation
            # create a record to host layerwise outputs
            record = torch.zeros(len(self.layers)+1, batch_size, self.features)
            # autoregressive batch sampling
            for i in range(self.features):
                for l, layer in enumerate(self.layers):
                    if isinstance(layer, AutoregressiveLinear): # linear layer
                        record[l+1, :, i] = layer.forward_at(record[l], i)
                    else: # elementwise layer
                        record[l+1, :, i] = layer(record[l, :, i])
                record[0, :, i] = dist.Bernoulli(record[-1, :, i]).sample()
        return record[0]

### Architecture

The conditional probabilities are modeled as Bernoulli distributions, whose probabilities are given by the autoregressive feed forward neural network. To respect the causal structure, the first autoregressive layer must have `diagonal=-1` such that $y_i$ depends on $(x_1,\cdots,x_{i-1})$. To esure that the output are probabilities (real numbers between 0 and 1), the last layer nonlineary activation must be Sigmoid. The internal nonlinear activation can be specified freely (default: ReLU). The architecture of a depth-3 autoregressive model will be like:

In [11]:
AutoregressiveModel(5, depth=3)

AutoregressiveModel(
  (features): 5
  (layers): ModuleList(
    (0): AutoregressiveLinear(in_features=5, out_features=5, bias=True, diagonal=-1)
    (1): ReLU()
    (2): AutoregressiveLinear(in_features=5, out_features=5, bias=True, diagonal=0)
    (3): ReLU()
    (4): AutoregressiveLinear(in_features=5, out_features=5, bias=True, diagonal=0)
    (5): Sigmoid()
  )
)

### Bernoulli Distribution

The forward pass will calculate the Bernoulli probability parameter: 
$$p_i = f(x_1,\cdots,x_{i-1}).$$ 
The Bernoulli samples are binary (0 or 1), where $x_i=1$ with probability $p_i$ and $x_i=0$ with probability $1-p_i$. The log conditional probability is given by
$$\log p(x_i|x_1,\cdots,x_{i-1})= x_i\log p_i + (1-x_i)\log(1-p_i).$$
The log probability of the autoregressive model is given by the summation
$$\log p(x_1,x_2,\cdots) = \sum_{i=1}^n\log p(x_i|x_1,\cdots,x_{i-1}).$$

In [12]:
AutoregressiveModel(5, depth=3).log_prob(torch.tensor([0.,1.,0.,0.,0.]))

tensor(-3.6187, grad_fn=<SumBackward1>)

In [13]:
AutoregressiveModel(5, depth=3).sample()

tensor([[0., 0., 1., 1., 0.]])

### Toy Model: Ferromagets

Frist generate some fake training data

In [14]:
data = torch.tensor([
    [1., 1., 1., 1., 1.]
])

Create a model. Initially, the generative model just randomly sample.

In [15]:
model = AutoregressiveModel(5, depth=1)
model.sample(5)

tensor([[0., 1., 1., 1., 1.],
        [1., 0., 1., 1., 1.],
        [0., 0., 0., 0., 1.],
        [0., 1., 1., 0., 0.],
        [1., 1., 1., 1., 1.]])

Train with the dataset use the negative log likelihood loss.

In [17]:
optimizer = optim.Adam(model.parameters(), lr=0.5)
train_loss = 0.
for epoch in range(500):
    loss = -torch.sum(model.log_prob(data))
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    train_loss += loss.item()
    if (epoch+1)%100 == 0:
        print('loss : {:.4f}'.format(train_loss / 100))
        train_loss = 0.

loss : 0.0002
loss : 0.0000
loss : 0.0000
loss : 0.0000
loss : 0.0000


The model learns to generate all 1's.

In [18]:
model.sample(5)

tensor([[1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.]])

The trick is to make the bias very positive. Such that, regardless of what the configuration is, the last layer will always output a probability close to 1, hence the sampler will be strongly biased towards 1.

In [19]:
list(model.parameters())

[Parameter containing:
 tensor([[-0.3399, -0.1737,  0.1230, -0.2522, -0.1934],
         [ 7.7969, -0.1262,  0.2446, -0.0281, -0.0340],
         [ 6.2119,  5.8219,  0.3396, -0.2230,  0.0357],
         [ 5.4637,  4.9182,  5.5170, -0.1437,  0.1338],
         [ 4.9764,  4.4428,  4.5466,  4.7888, -0.1460]], requires_grad=True),
 Parameter containing:
 tensor([13.7705,  8.0811,  6.0817,  4.8811,  4.7246], requires_grad=True)]

Let us try a bit more challenging dataset, which are either all-1 or all-0, meaning that the spins are correlated together. The neural network must learn about the correlation to model the dataset correctly. The previous bias trick would not work. 

In [20]:
data = torch.tensor([
    [1., 1., 1., 1., 1.],
    [0., 0., 0., 0., 0.]
])
model = AutoregressiveModel(5, depth=1)

Train it.

In [21]:
optimizer = optim.Adam(model.parameters(), lr=1.)
train_loss = 0.
for epoch in range(500):
    loss = -torch.sum(model.log_prob(data))
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    train_loss += loss.item()
    if (epoch+1)%100 == 0:
        print('loss : {:.4f}'.format(train_loss / 100))
        train_loss = 0.

loss : 1.5514
loss : 1.3880
loss : 1.3875
loss : 1.3871
loss : 1.3869


The neural network successfully learns how to generate correlated samples as well.

In [24]:
model.sample(5)

tensor([[1., 1., 1., 1., 1.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.]])

The trick is to make the weight matrix very positive in the lower-triangle to mediate the strong correlation among spins. The first bias vanishes to ensure unbiased sampling between all-1 and all-0.

In [25]:
list(model.parameters())

[Parameter containing:
 tensor([[ 1.8387e-01,  5.0940e-02, -2.4698e-01,  2.2660e-01, -1.6552e-02],
         [ 1.7140e+01,  2.8545e-01,  3.0627e-01, -3.6814e-02,  9.8888e-02],
         [ 9.4619e+00,  9.5402e+00, -4.1717e-01,  3.6980e-01,  3.0317e-02],
         [ 7.5032e+00,  8.0729e+00,  7.5546e+00,  2.7915e-01, -8.6019e-02],
         [ 6.3904e+00,  6.8154e+00,  6.6266e+00,  6.7329e+00,  1.1359e-01]],
        requires_grad=True), Parameter containing:
 tensor([ 4.1628e-07, -8.3664e+00, -9.1003e+00, -1.1455e+01, -1.1585e+01],
        requires_grad=True)]