In [None]:
%matplotlib inline


Language Modeling with nn.Transformer and TorchText
===============================================================

This is a tutorial on training a sequence-to-sequence model that uses the
`nn.Transformer <https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html>`__ module.

The PyTorch 1.2 release includes a standard transformer module based on the
paper `Attention is All You Need <https://arxiv.org/pdf/1706.03762.pdf>`__.
Compared to Recurrent Neural Networks (RNNs), the transformer model has proven
to be superior in quality for many sequence-to-sequence tasks while being more
parallelizable. The ``nn.Transformer`` module relies entirely on an attention
mechanism (implemented as
`nn.MultiheadAttention <https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html>`__)
to draw global dependencies between input and output. The ``nn.Transformer``
module is highly modularized such that a single component (e.g.,
`nn.TransformerEncoder <https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoder.html>`__)
can be easily adapted/composed.

![](https://github.com/pytorch/tutorials/blob/gh-pages/_downloads/_static/img/transformer_architecture.jpg?raw=1)





Define the model
----------------




In this tutorial, we train a ``nn.TransformerEncoder`` model on a
language modeling task. The language modeling task is to assign a
probability for the likelihood of a given word (or a sequence of words)
to follow a sequence of words. A sequence of tokens are passed to the embedding
layer first, followed by a positional encoding layer to account for the order
of the word (see the next paragraph for more details). The
``nn.TransformerEncoder`` consists of multiple layers of
`nn.TransformerEncoderLayer <https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoderLayer.html>`__.
Along with the input sequence, a square attention mask is required because the
self-attention layers in ``nn.TransformerEncoder`` are only allowed to attend
the earlier positions in the sequence. For the language modeling task, any
tokens on the future positions should be masked. To produce a probability
distribution over output words, the output of the ``nn.TransformerEncoder``
model is passed through a linear layer followed by a log-softmax function.




In [1]:
import math
from typing import Tuple

import torch
from torch import nn, Tensor
import torch.nn.functional as F
from torch.nn import TransformerEncoder, TransformerEncoderLayer
from torch.utils.data import dataset

class TransformerModel(nn.Module):

    def __init__(self, ntoken: int, d_model: int, nhead: int, d_hid: int,
                 nlayers: int, dropout: float = 0.5):
        super().__init__()
        self.model_type = 'Transformer'
        self.pos_encoder = PositionalEncoding(d_model, dropout)
        encoder_layers = TransformerEncoderLayer(d_model, nhead, d_hid, dropout)
        self.transformer_encoder = TransformerEncoder(encoder_layers, nlayers)
        self.encoder = nn.Embedding(ntoken, d_model)
        self.d_model = d_model
        self.decoder = nn.Linear(d_model, ntoken)

        self.init_weights()

    def init_weights(self) -> None:
        initrange = 0.1
        self.encoder.weight.data.uniform_(-initrange, initrange)
        self.decoder.bias.data.zero_()
        self.decoder.weight.data.uniform_(-initrange, initrange)

    def forward(self, src: Tensor, src_mask: Tensor) -> Tensor:
        """
        Args:
            src: Tensor, shape [seq_len, batch_size]
            src_mask: Tensor, shape [seq_len, seq_len]

        Returns:
            output Tensor of shape [seq_len, batch_size, ntoken]
        """
        src = self.encoder(src) * math.sqrt(self.d_model)
        src = self.pos_encoder(src)
        output = self.transformer_encoder(src, src_mask)
        output = self.decoder(output)
        return output


def generate_square_subsequent_mask(sz: int) -> Tensor:
    """Generates an upper-triangular matrix of -inf, with zeros on diag."""
    return torch.triu(torch.ones(sz, sz) * float('-inf'), diagonal=1)

``PositionalEncoding`` module injects some information about the
relative or absolute position of the tokens in the sequence. The
positional encodings have the same dimension as the embeddings so that
the two can be summed. Here, we use ``sine`` and ``cosine`` functions of
different frequencies.




In [2]:
class PositionalEncoding(nn.Module):

    def __init__(self, d_model: int, dropout: float = 0.1, max_len: int = 5000):
        super().__init__()
        self.dropout = nn.Dropout(p=dropout)

        position = torch.arange(max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))
        pe = torch.zeros(max_len, 1, d_model)
        pe[:, 0, 0::2] = torch.sin(position * div_term)
        pe[:, 0, 1::2] = torch.cos(position * div_term)
        self.register_buffer('pe', pe)

    def forward(self, x: Tensor) -> Tensor:
        """
        Args:
            x: Tensor, shape [seq_len, batch_size, embedding_dim]
        """
        x = x + self.pe[:x.size(0)]
        return self.dropout(x)

Load and batch data
-------------------




This tutorial uses ``torchtext`` to generate Wikitext-2 dataset. The
vocab object is built based on the train dataset and is used to numericalize
tokens into tensors. Wikitext-2 represents rare tokens as `<unk>`.

Given a 1-D vector of sequential data, ``batchify()`` arranges the data
into ``batch_size`` columns. If the data does not divide evenly into
``batch_size`` columns, then the data is trimmed to fit. For instance, with
the alphabet as the data (total length of 26) and ``batch_size=4``, we would
divide the alphabet into 4 sequences of length 6:

\begin{align}\begin{bmatrix}
  \text{A} & \text{B} & \text{C} & \ldots & \text{X} & \text{Y} & \text{Z}
  \end{bmatrix}
  \Rightarrow
  \begin{bmatrix}
  \begin{bmatrix}\text{A} \\ \text{B} \\ \text{C} \\ \text{D} \\ \text{E} \\ \text{F}\end{bmatrix} &
  \begin{bmatrix}\text{G} \\ \text{H} \\ \text{I} \\ \text{J} \\ \text{K} \\ \text{L}\end{bmatrix} &
  \begin{bmatrix}\text{M} \\ \text{N} \\ \text{O} \\ \text{P} \\ \text{Q} \\ \text{R}\end{bmatrix} &
  \begin{bmatrix}\text{S} \\ \text{T} \\ \text{U} \\ \text{V} \\ \text{W} \\ \text{X}\end{bmatrix}
  \end{bmatrix}\end{align}

Batching enables more parallelizable processing. However, batching means that
the model treats each column independently; for example, the dependence of
``G`` and ``F`` can not be learned in the example above.




In [3]:
from torchtext.datasets import WikiText2
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

train_iter = WikiText2(split='train')
tokenizer = get_tokenizer('basic_english')
vocab = build_vocab_from_iterator(map(tokenizer, train_iter), specials=['<unk>'])
vocab.set_default_index(vocab['<unk>']) 

def data_process(raw_text_iter: dataset.IterableDataset) -> Tensor:
    """Converts raw text into a flat Tensor."""
    data = [torch.tensor(vocab(tokenizer(item)), dtype=torch.long) for item in raw_text_iter]
    return torch.cat(tuple(filter(lambda t: t.numel() > 0, data)))

# train_iter was "consumed" by the process of building the vocab,
# so we have to create it again
train_iter, val_iter, test_iter = WikiText2()
train_data = data_process(train_iter)
val_data = data_process(val_iter)
test_data = data_process(test_iter)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

def batchify(data: Tensor, bsz: int) -> Tensor:
    """Divides the data into bsz separate sequences, removing extra elements
    that wouldn't cleanly fit.

    Args:
        data: Tensor, shape [N]
        bsz: int, batch size

    Returns:
        Tensor of shape [N // bsz, bsz]
    """
    seq_len = data.size(0) // bsz
    data = data[:seq_len * bsz]
    data = data.view(bsz, seq_len).t().contiguous()
    return data.to(device)

batch_size = 20
eval_batch_size = 10
train_data = batchify(train_data, batch_size)  # shape [seq_len, batch_size]
val_data = batchify(val_data, eval_batch_size)
test_data = batchify(test_data, eval_batch_size)

100%|██████████| 4.48M/4.48M [00:00<00:00, 6.84MB/s]


Functions to generate input and target sequence
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~




``get_batch()`` generates a pair of input-target sequences for
the transformer model. It subdivides the source data into chunks of
length ``bptt``. For the language modeling task, the model needs the
following words as ``Target``. For example, with a ``bptt`` value of 2,
we’d get the following two Variables for ``i`` = 0:

![](https://github.com/pytorch/tutorials/blob/gh-pages/_downloads/_static/img/transformer_input_target.png?raw=1)


It should be noted that the chunks are along dimension 0, consistent
with the ``S`` dimension in the Transformer model. The batch dimension
``N`` is along dimension 1.




In [4]:
bptt = 35
def get_batch(source: Tensor, i: int) -> Tuple[Tensor, Tensor]:
    """
    Args:
        source: Tensor, shape [full_seq_len, batch_size]
        i: int

    Returns:
        tuple (data, target), where data has shape [seq_len, batch_size] and
        target has shape [seq_len * batch_size]
    """
    seq_len = min(bptt, len(source) - 1 - i)
    data = source[i:i+seq_len]
    target = source[i+1:i+1+seq_len].reshape(-1)
    return data, target

Initiate an instance
--------------------




The model hyperparameters are defined below. The vocab size is
equal to the length of the vocab object.




In [5]:
ntokens = len(vocab)  # size of vocabulary
emsize = 200  # embedding dimension
d_hid = 200  # dimension of the feedforward network model in nn.TransformerEncoder
nlayers = 2  # number of nn.TransformerEncoderLayer in nn.TransformerEncoder
nhead = 2  # number of heads in nn.MultiheadAttention
dropout = 0.2  # dropout probability
model = TransformerModel(ntokens, emsize, nhead, d_hid, nlayers, dropout).to(device)

Run the model
-------------




We use `CrossEntropyLoss <https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html>`__
with the `SGD <https://pytorch.org/docs/stable/generated/torch.optim.SGD.html>`__
(stochastic gradient descent) optimizer. The learning rate is initially set to
5.0 and follows a `StepLR <https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.StepLR.html>`__
schedule. During training, we use `nn.utils.clip_grad_norm\_ <https://pytorch.org/docs/stable/generated/torch.nn.utils.clip_grad_norm_.html>`__
to prevent gradients from exploding.




In [6]:
import copy
import time

criterion = nn.CrossEntropyLoss()
lr = 5.0  # learning rate
optimizer = torch.optim.SGD(model.parameters(), lr=lr)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.95)

def train(model: nn.Module) -> None:
    model.train()  # turn on train mode
    total_loss = 0.
    log_interval = 200
    start_time = time.time()
    src_mask = generate_square_subsequent_mask(bptt).to(device)

    num_batches = len(train_data) // bptt
    for batch, i in enumerate(range(0, train_data.size(0) - 1, bptt)):
        data, targets = get_batch(train_data, i)
        batch_size = data.size(0)
        if batch_size != bptt:  # only on last batch
            src_mask = src_mask[:batch_size, :batch_size]
        output = model(data, src_mask)
        loss = criterion(output.view(-1, ntokens), targets)

        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5)
        optimizer.step()

        total_loss += loss.item()
        if batch % log_interval == 0 and batch > 0:
            lr = scheduler.get_last_lr()[0]
            ms_per_batch = (time.time() - start_time) * 1000 / log_interval
            cur_loss = total_loss / log_interval
            ppl = math.exp(cur_loss)
            print(f'| epoch {epoch:3d} | {batch:5d}/{num_batches:5d} batches | '
                  f'lr {lr:02.2f} | ms/batch {ms_per_batch:5.2f} | '
                  f'loss {cur_loss:5.2f} | ppl {ppl:8.2f}')
            total_loss = 0
            start_time = time.time()

def evaluate(model: nn.Module, eval_data: Tensor) -> float:
    model.eval()  # turn on evaluation mode
    total_loss = 0.
    src_mask = generate_square_subsequent_mask(bptt).to(device)
    with torch.no_grad():
        for i in range(0, eval_data.size(0) - 1, bptt):
            data, targets = get_batch(eval_data, i)
            batch_size = data.size(0)
            if batch_size != bptt:
                src_mask = src_mask[:batch_size, :batch_size]
            output = model(data, src_mask)
            output_flat = output.view(-1, ntokens)
            total_loss += batch_size * criterion(output_flat, targets).item()
    return total_loss / (len(eval_data) - 1)

Loop over epochs. Save the model if the validation loss is the best
we've seen so far. Adjust the learning rate after each epoch.



In [7]:
best_val_loss = float('inf')
epochs = 3
best_model = None

for epoch in range(1, epochs + 1):
    epoch_start_time = time.time()
    train(model)
    val_loss = evaluate(model, val_data)
    val_ppl = math.exp(val_loss)
    elapsed = time.time() - epoch_start_time
    print('-' * 89)
    print(f'| end of epoch {epoch:3d} | time: {elapsed:5.2f}s | '
          f'valid loss {val_loss:5.2f} | valid ppl {val_ppl:8.2f}')
    print('-' * 89)

    if val_loss < best_val_loss:
        best_val_loss = val_loss
        best_model = copy.deepcopy(model)

    scheduler.step()

| epoch   1 |   200/ 2928 batches | lr 5.00 | ms/batch 38.67 | loss  8.15 | ppl  3474.92
| epoch   1 |   400/ 2928 batches | lr 5.00 | ms/batch 37.57 | loss  6.88 | ppl   968.98
| epoch   1 |   600/ 2928 batches | lr 5.00 | ms/batch 37.39 | loss  6.44 | ppl   627.44
| epoch   1 |   800/ 2928 batches | lr 5.00 | ms/batch 37.37 | loss  6.31 | ppl   549.67
| epoch   1 |  1000/ 2928 batches | lr 5.00 | ms/batch 37.35 | loss  6.19 | ppl   486.77
| epoch   1 |  1200/ 2928 batches | lr 5.00 | ms/batch 37.31 | loss  6.16 | ppl   472.42
| epoch   1 |  1400/ 2928 batches | lr 5.00 | ms/batch 37.30 | loss  6.12 | ppl   452.90
| epoch   1 |  1600/ 2928 batches | lr 5.00 | ms/batch 37.16 | loss  6.11 | ppl   450.33
| epoch   1 |  1800/ 2928 batches | lr 5.00 | ms/batch 37.14 | loss  6.02 | ppl   411.46
| epoch   1 |  2000/ 2928 batches | lr 5.00 | ms/batch 37.06 | loss  6.02 | ppl   410.97
| epoch   1 |  2200/ 2928 batches | lr 5.00 | ms/batch 37.18 | loss  5.90 | ppl   363.84
| epoch   1 |  2400/ 

Evaluate the best model on the test dataset
-------------------------------------------




In [8]:
test_loss = evaluate(best_model, test_data)
test_ppl = math.exp(test_loss)
print('=' * 89)
print(f'| End of training | test loss {test_loss:5.2f} | '
      f'test ppl {test_ppl:8.2f}')
print('=' * 89)

| End of training | test loss  5.49 | test ppl   242.02


# train with spsa

In [64]:

class SPSA:
    """
    An optimizer class that implements Simultaneous Perturbation Stochastic Approximation (SPSA)
    """
    def __init__(self, a, c, A, alpha, gamma):
        # Initialize gain parameters and decay factors
        self.a = a
        self.c = c
        self.A = A
        self.alpha = alpha
        self.gamma = gamma
        # counters
        self.t = 0

    def step(self, model, inputs, label, src_mask , opt_targets=None):
        """
        :param current_estimate: This is the current estimate of the parameter vector
        :return: returns the updated estimate of the vector
        """
        if opt_targets==None:
          opt_targets = ["encoder.weight","decoder.weight","decoder.bias"]
        a_t = self.a / (self.t + 1 + self.A)**self.alpha
        c_t = self.c / (self.t + 1)**self.gamma

        # get the random perturbation vector from bernoulli distribution
        # it has to be symmetric around zero
        # But normal distribution does not work which makes the perturbations close to zero
        # Also, uniform distribution should not be used since they are not around zero
        current_param = copy.deepcopy(model.state_dict())
        current_param_plus = current_param.copy()
        current_param_minus = current_param.copy()
        delta_dict = {}

        for key, param in current_param.items():
            delta = torch.randint(0,2,param.size()) * 2 - 1         
            if key in opt_targets:
              current_param_plus[key] = param + delta * c_t
              current_param_minus[key] = param - delta * c_t
            else:
              current_param_plus[key] = param #+ delta * c_t
              current_param_minus[key] = param #- delta * c_t
            delta_dict[key] = delta

        # measure the loss function at perturbations
        ## calculate loss_plus
        model.load_state_dict(current_param_plus)
        predictions = model(inputs, src_mask)
        loss_plus = criterion(predictions.view(-1, ntokens), label)
#         print("predictions", predictions)
#         print("weights", current_param_plus)
#         print("loss_plus",loss_plus)
#         print(predictions, label)
        ##  calculate loss_minus
        model.load_state_dict(current_param_minus)
        predictions = model(inputs, src_mask)
        loss_minus = criterion(predictions.view(-1, ntokens), label)
#         loss_minus.backward()
#         print("---",current_param['transformers.0.norm2.bias'])
#         print("---",current_param_plus['transformers.0.norm2.bias'])
#         print("---",current_param_minus['transformers.0.norm2.bias'])
        for key, param in current_param.items():
            # compute the estimate of the gradient
            g_t = (loss_plus - loss_minus) / (2.0 * delta_dict[key] * c_t)
            # update the estimate of the parameter
            if key in opt_targets:
                current_param[key] = param  - a_t * g_t
#                 print(key, param, a_t, - a_t * g_t)
            else:
                current_param[key] = param #- a_t * g_t
        # increment the counter
        # print(key, param, - a_t * g_t)    
        self.t += 1    

        model.load_state_dict(current_param)
   

In [59]:
list(model.state_dict().keys())

['pos_encoder.pe',
 'transformer_encoder.layers.0.self_attn.in_proj_weight',
 'transformer_encoder.layers.0.self_attn.in_proj_bias',
 'transformer_encoder.layers.0.self_attn.out_proj.weight',
 'transformer_encoder.layers.0.self_attn.out_proj.bias',
 'transformer_encoder.layers.0.linear1.weight',
 'transformer_encoder.layers.0.linear1.bias',
 'transformer_encoder.layers.0.linear2.weight',
 'transformer_encoder.layers.0.linear2.bias',
 'transformer_encoder.layers.0.norm1.weight',
 'transformer_encoder.layers.0.norm1.bias',
 'transformer_encoder.layers.0.norm2.weight',
 'transformer_encoder.layers.0.norm2.bias',
 'transformer_encoder.layers.1.self_attn.in_proj_weight',
 'transformer_encoder.layers.1.self_attn.in_proj_bias',
 'transformer_encoder.layers.1.self_attn.out_proj.weight',
 'transformer_encoder.layers.1.self_attn.out_proj.bias',
 'transformer_encoder.layers.1.linear1.weight',
 'transformer_encoder.layers.1.linear1.bias',
 'transformer_encoder.layers.1.linear2.weight',
 'transform

In [65]:
max_iter = 10000
optimizer_spsa = SPSA(a=9e-1, c=1.0, A=max_iter/10, alpha=0.6, gamma=0.1)

def train_spsa(model: nn.Module, opt_targets=None) -> None:
    model.train()  # turn on train mode
    total_loss = 0.
    log_interval = 200
    start_time = time.time()
    src_mask = generate_square_subsequent_mask(bptt).to(device)

    num_batches = len(train_data) // bptt
    for batch, i in enumerate(range(0, train_data.size(0) - 1, bptt)):
        data, targets = get_batch(train_data, i)
        batch_size = data.size(0)
        if batch_size != bptt:  # only on last batch
            src_mask = src_mask[:batch_size, :batch_size]
        output = model(data, src_mask)
        loss = criterion(output.view(-1, ntokens), targets)

        # optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5)
        # optimizer.step()
        optimizer_spsa.step(model, data, targets, src_mask, opt_targets)
        total_loss += loss.item()
        if batch % log_interval == 0 and batch > 0:
            lr = scheduler.get_last_lr()[0]
            ms_per_batch = (time.time() - start_time) * 1000 / log_interval
            cur_loss = total_loss / log_interval
            ppl = math.exp(cur_loss)
            print(f'| epoch {epoch:3d} | {batch:5d}/{num_batches:5d} batches | '
                  f'lr {lr:02.2f} | ms/batch {ms_per_batch:5.2f} | '
                  f'loss {cur_loss:5.2f} | ppl {ppl:8.2f}')
            total_loss = 0
            start_time = time.time()

In [66]:
ntokens = len(vocab)  # size of vocabulary
emsize = 200  # embedding dimension
d_hid = 200  # dimension of the feedforward network model in nn.TransformerEncoder
nlayers = 2  # number of nn.TransformerEncoderLayer in nn.TransformerEncoder
nhead = 2  # number of heads in nn.MultiheadAttention
dropout = 0.2  # dropout probability
model = TransformerModel(ntokens, emsize, nhead, d_hid, nlayers, dropout).to(device)

In [58]:
torch.set_default_tensor_type('torch.cuda.FloatTensor')


best_val_loss = float('inf')
epochs = 3
best_model = None

for epoch in range(1, epochs + 1):
    epoch_start_time = time.time()
    train_spsa(model)
    val_loss = evaluate(model, val_data)
    val_ppl = math.exp(val_loss)
    elapsed = time.time() - epoch_start_time
    print('-' * 89)
    print(f'| end of epoch {epoch:3d} | time: {elapsed:5.2f}s | '
          f'valid loss {val_loss:5.2f} | valid ppl {val_ppl:8.2f}')
    print('-' * 89)

    if val_loss < best_val_loss:
        best_val_loss = val_loss
        best_model = copy.deepcopy(model)

    scheduler.step()

| epoch   1 |   200/ 2928 batches | lr 4.29 | ms/batch 75.41 | loss 55.62 | ppl 1426428548985125720817664.00
| epoch   1 |   400/ 2928 batches | lr 4.29 | ms/batch 74.53 | loss 54.90 | ppl 693116983013783243849728.00
| epoch   1 |   600/ 2928 batches | lr 4.29 | ms/batch 74.40 | loss 54.54 | ppl 485031183358012902342656.00
| epoch   1 |   800/ 2928 batches | lr 4.29 | ms/batch 74.41 | loss 54.09 | ppl 310948258195586903179264.00
| epoch   1 |  1000/ 2928 batches | lr 4.29 | ms/batch 74.27 | loss 53.45 | ppl 163842425800033216495616.00
| epoch   1 |  1200/ 2928 batches | lr 4.29 | ms/batch 74.31 | loss 52.93 | ppl 96673308580555717083136.00
| epoch   1 |  1400/ 2928 batches | lr 4.29 | ms/batch 74.23 | loss 52.44 | ppl 59240248544478567071744.00
| epoch   1 |  1600/ 2928 batches | lr 4.29 | ms/batch 74.52 | loss 52.14 | ppl 44014839079273209266176.00
| epoch   1 |  1800/ 2928 batches | lr 4.29 | ms/batch 74.53 | loss 51.66 | ppl 27232216837862413303808.00
| epoch   1 |  2000/ 2928 batch

In [63]:
test_loss = evaluate(best_model, test_data)
test_ppl = math.exp(test_loss)
print('=' * 89)
print(f'| End of training | test loss {test_loss:5.2f} | '
      f'test ppl {test_ppl:8.2f}')
print('=' * 89)

| End of training | test loss 41.48 | test ppl 1030298606960375936.00


In [71]:
ntokens = len(vocab)  # size of vocabulary
emsize = 200  # embedding dimension
d_hid = 200  # dimension of the feedforward network model in nn.TransformerEncoder
nlayers = 2  # number of nn.TransformerEncoderLayer in nn.TransformerEncoder
nhead = 2  # number of heads in nn.MultiheadAttention
dropout = 0.2  # dropout probability
model = TransformerModel(ntokens, emsize, nhead, d_hid, nlayers, dropout).to(device)

In [72]:

# 全てのパラメータをSPSAで更新

torch.set_default_tensor_type('torch.cuda.FloatTensor')


best_val_loss = float('inf')
epochs = 3
best_model = None

for epoch in range(1, epochs + 1):
    epoch_start_time = time.time()
    train_spsa(model, opt_targets=["encoder.weight","decoder.weight","decoder.bias"])
    val_loss = evaluate(model, val_data)
    val_ppl = math.exp(val_loss)
    elapsed = time.time() - epoch_start_time
    print('-' * 89)
    print(f'| end of epoch {epoch:3d} | time: {elapsed:5.2f}s | '
          f'valid loss {val_loss:5.2f} | valid ppl {val_ppl:8.2f}')
    print('-' * 89)

    if val_loss < best_val_loss:
        best_val_loss = val_loss
        best_model = copy.deepcopy(model)

    scheduler.step()

| epoch   1 |   200/ 2928 batches | lr 3.15 | ms/batch 75.66 | loss 10.62 | ppl 41127.01
| epoch   1 |   400/ 2928 batches | lr 3.15 | ms/batch 74.81 | loss 10.46 | ppl 35017.76
| epoch   1 |   600/ 2928 batches | lr 3.15 | ms/batch 74.57 | loss 10.35 | ppl 31397.23
| epoch   1 |   800/ 2928 batches | lr 3.15 | ms/batch 74.62 | loss 10.25 | ppl 28211.79
| epoch   1 |  1000/ 2928 batches | lr 3.15 | ms/batch 74.57 | loss 10.16 | ppl 25953.29
| epoch   1 |  1200/ 2928 batches | lr 3.15 | ms/batch 74.63 | loss 10.08 | ppl 23933.00
| epoch   1 |  1400/ 2928 batches | lr 3.15 | ms/batch 74.57 | loss 10.00 | ppl 22076.41
| epoch   1 |  1600/ 2928 batches | lr 3.15 | ms/batch 74.58 | loss  9.93 | ppl 20579.23
| epoch   1 |  1800/ 2928 batches | lr 3.15 | ms/batch 74.49 | loss  9.87 | ppl 19259.88
| epoch   1 |  2000/ 2928 batches | lr 3.15 | ms/batch 74.56 | loss  9.81 | ppl 18305.83
| epoch   1 |  2200/ 2928 batches | lr 3.15 | ms/batch 74.64 | loss  9.76 | ppl 17307.55
| epoch   1 |  2400/ 

In [73]:

!nvidia-smi 

Sun Jan  9 12:02:57 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   73C    P0    76W / 149W |   2929MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [68]:
# 全てのパラメータをSPSAで更新

torch.set_default_tensor_type('torch.cuda.FloatTensor')


best_val_loss = float('inf')
epochs = 3
best_model = None

for epoch in range(1, epochs + 1):
    epoch_start_time = time.time()
    train_spsa(model, opt_targets=model.state_dict().keys())
    val_loss = evaluate(model, val_data)
    val_ppl = math.exp(val_loss)
    elapsed = time.time() - epoch_start_time
    print('-' * 89)
    print(f'| end of epoch {epoch:3d} | time: {elapsed:5.2f}s | '
          f'valid loss {val_loss:5.2f} | valid ppl {val_ppl:8.2f}')
    print('-' * 89)

    if val_loss < best_val_loss:
        best_val_loss = val_loss
        best_model = copy.deepcopy(model)

    scheduler.step()

| epoch   1 |   200/ 2928 batches | lr 3.68 | ms/batch 76.42 | loss 24.15 | ppl 30739406971.32
| epoch   1 |   400/ 2928 batches | lr 3.68 | ms/batch 75.74 | loss 40.60 | ppl 429427161454926400.00
| epoch   1 |   600/ 2928 batches | lr 3.68 | ms/batch 75.56 | loss 62.54 | ppl 1453516692039705632591314944.00
| epoch   1 |   800/ 2928 batches | lr 3.68 | ms/batch 75.34 | loss 93.71 | ppl 49750002074957878760870046144511246073856.00
| epoch   1 |  1000/ 2928 batches | lr 3.68 | ms/batch 74.62 | loss 152.81 | ppl 2308629779451285252035598957565373098143028052492586007224955961344.00
| epoch   1 |  1200/ 2928 batches | lr 3.68 | ms/batch 74.73 | loss 237.81 | ppl 19037977400566979990645582154775349097392989533864816480179853861807259855279749924182652722933842051072.00
| epoch   1 |  1400/ 2928 batches | lr 3.68 | ms/batch 74.76 | loss 373.69 | ppl 195875723985782405232730344646608793701597798143866421832054301325027322439916395337510995405130623704791427251071706919962127504208612073406175

OverflowError: ignored

# debug

In [None]:

class SPSA:
    """
    An optimizer class that implements Simultaneous Perturbation Stochastic Approximation (SPSA)
    """
    def __init__(self, a, c, A, alpha, gamma, loss_function):
        # Initialize gain parameters and decay factors
        self.a = a
        self.c = c
        self.A = A
        self.alpha = alpha
        self.gamma = gamma
        self.loss_function = loss_function

        # counters
        self.t = 0

    def step(self, model, inputs, label, src_mask , opt_targets):
        """
        :param current_estimate: This is the current estimate of the parameter vector
        :return: returns the updated estimate of the vector
        """
        opt_targets = ["encoder.weight","decoder.weight","decoder.bias"]
        # opt_targets = ['transformers.0.attn.k_linear.weight', 'transformers.0.attn.q_linear.weight', 'transformers.0.attn.v_linear.weight']
#         opt_targets = ['transformers.0.ffn.linear_1.weight', 'transformers.0.ffn.linear_1.bias', 'transformers.0.ffn.linear_2.weight', 'transformers.0.ffn.linear_2.bias', 'class_logits.weight', 'class_logits.bias']
#         [ 'transformers.0.norm2.bias','transformers.0.attn.k_linear.weight', 'transformers.0.attn.q_linear.weight', 'transformers.0.attn.v_linear.weight', 'transformers.0.attn.combine_heads.weight', 'transformers.0.ffn.linear_1.weight', 'transformers.0.ffn.linear_1.bias', 'transformers.0.ffn.linear_2.weight', 'transformers.0.ffn.linear_2.bias', 'class_logits.weight', 'class_logits.bias']
#         ['token_embedding.weight', 'pos_embedding.pe', 'transformers.0.norm1.weight', 'transformers.0.norm1.bias', 'transformers.0.norm2.weight', 'transformers.0.norm2.bias', 'transformers.0.attn.k_linear.weight', 'transformers.0.attn.q_linear.weight', 'transformers.0.attn.v_linear.weight', 'transformers.0.attn.combine_heads.weight', 'transformers.0.ffn.linear_1.weight', 'transformers.0.ffn.linear_1.bias', 'transformers.0.ffn.linear_2.weight', 'transformers.0.ffn.linear_2.bias', 'class_logits.weight', 'class_logits.bias']
#         [ 'transformers.0.ffn.linear_1.weight', 'transformers.0.ffn.linear_1.bias', 'transformers.0.ffn.linear_2.weight', 'transformers.0.ffn.linear_2.bias', 'class_logits.weight', 'class_logits.bias']
        # get the current values for gain sequences
        a_t = self.a / (self.t + 1 + self.A)**self.alpha
        c_t = self.c / (self.t + 1)**self.gamma

        # get the random perturbation vector from bernoulli distribution
        # it has to be symmetric around zero
        # But normal distribution does not work which makes the perturbations close to zero
        # Also, uniform distribution should not be used since they are not around zero
        current_param = copy.deepcopy(model.state_dict())
#         print("===",current_param['transformers.0.norm2.bias'])
        current_param_plus = current_param.copy()
        current_param_minus = current_param.copy()
        delta_dict = {}

        for key, param in current_param.items():
            delta = torch.randint(0,2,param.size()) * 2 - 1
#             current_param_plus[key] = param + delta * c_t
#             current_param_minus[key] = param - delta * c_t
            
            if key in opt_targets:
              current_param_plus[key] = param + delta * c_t
              current_param_minus[key] = param - delta * c_t
            else:
              current_param_plus[key] = param #+ delta * c_t
              current_param_minus[key] = param #- delta * c_t
            delta_dict[key] = delta

        # measure the loss function at perturbations
        # calculate loss_plus
        model.load_state_dict(current_param_plus)
        # predictions = model(inputs).squeeze(1)
        predictions = model(inputs, src_mask)
#         predictions = predictions/predictions.max().abs()
        loss_plus = criterion(predictions.view(-1, ntokens), label)
        # loss_plus = self.loss_function(predictions, label)
#         loss_plus.backward()
#         self.loss_function.zero_grad()
#         print("predictions", predictions)
#         print("weights", current_param_plus)
#         print("loss_plus",loss_plus)
#         print(predictions, label)
        # calculate loss_minus
        model.load_state_dict(current_param_minus)
        predictions = model(inputs, src_mask)
        # predictions = model(inputs).squeeze(1)
#         predictions = predictions/predictions.max().abs()
        loss_minus = criterion(predictions.view(-1, ntokens), label)
        # loss_minus = self.loss_function(predictions, label)
#         loss_minus.backward()
#         print("---",current_param['transformers.0.norm2.bias'])
#         print("---",current_param_plus['transformers.0.norm2.bias'])
#         print("---",current_param_minus['transformers.0.norm2.bias'])
        for key, param in current_param.items():
            # compute the estimate of the gradient
            g_t = (loss_plus - loss_minus) / (2.0 * delta_dict[key] * c_t)
            # update the estimate of the parameter
#             current_param[key] = param - a_t * g_t
            if key in opt_targets:
                current_param[key] = param  - a_t * g_t
#                 print(key, param, a_t, - a_t * g_t)
            else:
                current_param[key] = param #- a_t * g_t
        # increment the counter
        # print(key, param, - a_t * g_t)    
        self.t += 1    

        model.load_state_dict(current_param)
   