In [14]:
import torch
import torch.nn as nn

### A (very small) introduction to pytorch

Pytorch Tensors are very similar to Numpy arrays, with the added benefit of being usable on GPU. For a short tutorial on various methods to create tensors of particular types, see [this link](https://pytorch.org/tutorials/beginner/blitz/tensor_tutorial.html#sphx-glr-beginner-blitz-tensor-tutorial-py).
The important things to note are that Tensors can be created empty, from lists, and it is very easy to convert a numpy array into a pytorch tensor, and inversely.

In [15]:
a = torch.LongTensor(5)
b = torch.LongTensor([5])

print(a)
print(b)

tensor([5.])

In [16]:
a = torch.FloatTensor([2])
b = torch.FloatTensor([3])

print(a + b)

tensor([                  0, 7089621706657183540, 7089007972926448695,
        3688558274541610034, 7077183838608451174])
tensor([5])


The main interest in us using Pytorch is the ```autograd``` package. ```torch.Tensor```objects have an attribute ```.requires_grad```; if set as True, it starts to track all operations on it. When you finish your computation, can call ```.backward()``` and all the gradients are computed automatically (and stored in the ```.grad``` attribute).

One way to easily cut a tensor from the computational once it is not needed anymore is to use ```.detach()```.
More info on automatic differentiation in pytorch on [this link](https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html#sphx-glr-beginner-blitz-autograd-tutorial-py).

In [13]:
x = torch.tensor(1., requires_grad=True)
w = torch.tensor(2., requires_grad=True)
b = torch.tensor(3., requires_grad=True)

# Build a computational graph.
y = w * x + b    # y = 2 * x + 3

# Compute gradients.
y.backward()

# Print out the gradients.
print(x.grad)    # x.grad = 2 
print(w.grad)    # w.grad = 1 
print(b.grad)    # b.grad = 1 

tensor(2.)
tensor(1.)
tensor(1.)


In [28]:
x = torch.randn(10, 3)
y = torch.randn(10, 2)

# Build a fully connected layer.
linear = nn.Linear(3, 2)
for name, p in linear.named_parameters():
    print(name)
    print(p)

# Build loss function - Mean Square Error
criterion = nn.MSELoss()

# Forward pass.
pred = linear(x)

# Compute loss.
loss = criterion(pred, y)
print('Initial loss: ', loss.item())

# Backward pass.
loss.backward()

# Print out the gradients.
print ('dL/dw: ', linear.weight.grad) 
print ('dL/db: ', linear.bias.grad)

weight
Parameter containing:
tensor([[-0.1583, -0.1521,  0.5613],
        [ 0.1849, -0.5349, -0.3317]], requires_grad=True)
bias
Parameter containing:
tensor([-0.0956, -0.5487], requires_grad=True)
Initial loss:  1.9882283210754395
dL/dw:  tensor([[ 0.0466, -0.2819,  1.0870],
        [ 0.3121,  0.1475, -0.2172]])
dL/db:  tensor([-0.1086, -0.5911])


In [29]:
# You can perform gradient descent manually, with an in-place update ...
linear.weight.data.sub_(0.01 * linear.weight.grad.data)
linear.bias.data.sub_(0.01 * linear.bias.grad.data)

# Print out the loss after 1-step gradient descent.
pred = linear(x)
loss = criterion(pred, y)
print('Loss after one update: ', loss.item())

Loss after one update:  1.9704128503799438


In [30]:
# Use the optim package to define an Optimizer that will update the weights of the model.
optimizer = torch.optim.SGD(linear.parameters(), lr=0.01)

# By default, gradients are accumulated in buffers( i.e, not overwritten) whenever .backward()
# is called. Before the backward pass, we need to use the optimizer object to zero all of the
# gradients.
optimizer.zero_grad()
loss.backward()

# Calling the step function on an Optimizer makes an update to its parameters
optimizer.step()

# Print out the loss after the second step of gradient descent.
pred = linear(x)
loss = criterion(pred, y)
print('Loss after two updates: ', loss.item())

Loss after two updates:  1.9529603719711304


### Tools for data processing 

In [107]:
import os
import time
import math
from collections import Counter
import pprint
pp = pprint.PrettyPrinter(indent=1)

We create a ```Dictionary``` class, that we are going to use to create a vocabulary for our text data. The goal here is to have a convenient tool, with easy access to any information we could need:
- A python dictionary ```word2idx``` allowing easy transformation of tokenized text into indexes
- A list ```idx2word```, allowing us to find the word corresponding to an index (for interpretation and generation)
- A python dictionary ```counter``` used to build the vocabulary, that can provide us with frequency information if needed. 
- The ```total``` count of words in the dictionary.

Important: The data that we are going to use are already pre-processed so we don't need to create special tokens and control the size of the vocabulary ourselves. However, when the text data is raw, methods to preprocess it conveniently should be added here. 

In [45]:
class Dictionary(object):
    def __init__(self):
        self.word2idx = {}
        self.idx2word = []
        self.counter = {}
        self.total = 0

    def add_word(self, word):
        if word not in self.word2idx:
            self.idx2word.append(word)
            self.word2idx[word] = len(self.idx2word) - 1
            self.counter.setdefault(word, 0)
        self.counter[word] += 1
        self.total += 1
        return self.word2idx[word]

    def __len__(self):
        return len(self.idx2word)

In [46]:
with open('./wikitext-2/train.txt', 'r') as f:
    print(f.readline())
    print(f.readline())
    print(f.readline())
    print(f.readline())

 

 = Valkyria Chronicles III = 

 

 Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . <unk> the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " <unk> Raven " . 



In [47]:
# Let's take the four first lines of our training data:
corpus = ''
with open('./wikitext-2/train.txt', 'r') as f:
    for i in range(4):
        corpus += f.readline()
        
# Create an empty Dictionary, separate and add all words. 
dictio = Dictionary()
words = corpus.split()
for word in words:
    dictio.add_word(word)

# Take a look at the objects created:
pp.pprint(dictio.word2idx)
pp.pprint(dictio.idx2word)
pp.pprint(dictio.counter)
pp.pprint(dictio.total)

{'"': 60,
 '(': 9,
 ')': 18,
 ',': 12,
 '.': 14,
 '2011': 44,
 '3': 6,
 ':': 7,
 '<unk>': 8,
 '=': 0,
 '@-@': 29,
 'Battlefield': 17,
 'Chronicles': 2,
 'Europan': 70,
 'Gallia': 67,
 'III': 3,
 'Imperial': 80,
 'January': 43,
 'Japan': 24,
 'Japanese': 10,
 'Media.Vision': 37,
 'Nameless': 61,
 'PlayStation': 39,
 'Portable': 40,
 'Raven': 81,
 'Released': 41,
 'Second': 69,
 'Sega': 35,
 'Senjō': 4,
 'Valkyria': 1,
 'War': 71,
 'a': 26,
 'against': 79,
 'and': 36,
 'are': 77,
 'as': 22,
 'black': 75,
 'by': 34,
 'commonly': 19,
 'developed': 33,
 'during': 68,
 'first': 58,
 'follows': 59,
 'for': 38,
 'fusion': 49,
 'game': 32,
 'gameplay': 52,
 'in': 42,
 'is': 25,
 'it': 45,
 'its': 53,
 'lit': 13,
 'military': 63,
 'nation': 66,
 'no': 5,
 'of': 15,
 'operations': 76,
 'outside': 23,
 'parallel': 57,
 'penal': 62,
 'perform': 73,
 'pitted': 78,
 'playing': 30,
 'predecessors': 54,
 'real': 50,
 'referred': 20,
 'role': 28,
 'runs': 56,
 'same': 48,
 'secret': 74,
 'series': 47,
 

In [56]:
class Corpus(object):
    def __init__(self, path):
        # We create an object Dictionary associated to Corpus
        self.dictionary = Dictionary()
        # We go through all files, adding all words to the dictionary
        self.train = self.tokenize(os.path.join(path, 'train.txt'))
        self.valid = self.tokenize(os.path.join(path, 'valid.txt'))
        self.test = self.tokenize(os.path.join(path, 'test.txt'))
        
    def tokenize(self, path):
        """Tokenizes a text file, knowing the dictionary, in order to tranform it into a list of indexes"""
        assert os.path.exists(path)
        # Add words to the dictionary
        with open(path, 'r') as f:
            tokens = 0
            for line in f:
                words = line.split() + ['<eos>']
                for word in words:
                    self.dictionary.add_word(word)
                tokens += len(words)
        
        # Once done, go through the file a second time and fill a Torch Tensor with the associated indexes 
        with open(path, 'r') as f:
            ids = torch.LongTensor(tokens)
            token = 0
            for line in f:
                words = line.split() + ['<eos>']
                for word in words:
                    ids[token] = self.dictionary.word2idx[word]
                    token += 1
        return ids

In [57]:
###############################################################################
# Load data
###############################################################################

data = './wikitext-2-small/'
corpus = Corpus(data)

In [58]:
print(corpus.dictionary.total)
print(len(corpus.dictionary.idx2word))
print(len(corpus.dictionary.word2idx))

print(corpus.train.shape)
print(corpus.train[0:7])
print([corpus.dictionary.idx2word[corpus.train[i]] for i in range(7)])

print(corpus.valid.shape)
print(corpus.valid[0:7])
print([corpus.dictionary.idx2word[corpus.valid[i]] for i in range(7)])

383196
19482
19482
torch.Size([275485])
tensor([0, 1, 2, 3, 4, 1, 0])
['<eos>', '=', 'Valkyria', 'Chronicles', 'III', '=', '<eos>']
torch.Size([47945])
tensor([    0,     1, 17642, 17643,     1,     0,     0])
['<eos>', '=', 'Homarus', 'gammarus', '=', '<eos>', '<eos>']


In [59]:
# We now have data under a very long list of indexes: the text is as one sequence.
# The idea now is to create batches from this. Note that this is absolutely not the best
# way to proceed with large quantities of data (where we'll try not to store huge tensors
# in memory but read them from file as we go) !
# Here, we are looking for simplicity and efficiency with regards to computation time.
# That is why we will ignore sentence separations and treat the data as one long stream that
# we will cut arbitrarily as we need.
# With the alphabet being our data, we currently have the sequence:
# [a b c d e f g h i j k l m n o p q r s t u v w x y z]
# We want to reorganize it as independant batches that will be processed independantly by the model !
# For instance, with the alphabet as the sequence and batch size 4, we'd get the 4 following sequences:
# ┌ a g m s ┐
# │ b h n t │
# │ c i o u │
# │ d j p v │
# │ e k q w │
# └ f l r x ┘
# with the last two elements being lost.
# Again, these columns are treated as independent by the model, which means that the
# dependence of e. g. 'g' on 'f' can not be learned, but allows more efficient processing.

def batchify(data, batch_size, cuda = False):
    # Cut the elements that are unnecessary
    nbatch = data.size(0) // batch_size
    data = data.narrow(0, 0, nbatch * batch_size)
    # Reorganize the data
    data = data.view(batch_size, -1).t().contiguous()
    # If we can use a GPU, let's tranfer the tensor to it
    if cuda:
        data = data.cuda()
    return data

# get_batch subdivides the source data into chunks of the appropriate length.
# If source is equal to the example output of the batchify function, with
# a sequence length (seq_len) of 3, we'd get the following two variables:
# ┌ a g m s ┐ ┌ b h n t ┐
# | b h n t | | c i o u │
# └ c i o u ┘ └ d j p v ┘
# The first variable contains the letters input to the network, while the second
# contains the one we want the network to predict (b for a, h for g, v for u, etc..)
# Note that despite the name of the function, we are cutting the data in the
# temporal dimension, since we already divided data into batches in the previous
# function. 

def get_batch(source, i, seq_len, evaluation=False):
    # Deal with the possibility that there's not enough data left for a full sequence
    seq_len = min(seq_len, len(source) - 1 - i)
    # Take the input data
    data = source[i:i+seq_len]
    # Shift by one for the target data
    target = source[i+1:i+1+seq_len]
    return data, target

In [110]:
batch_size = 100
eval_batch_size = 4
train_data = batchify(corpus.train, batch_size)
val_data = batchify(corpus.valid, eval_batch_size)
test_data = batchify(corpus.test, eval_batch_size)

print(train_data.shape)
print(val_data.shape)

torch.Size([2754, 100])
torch.Size([11986, 4])


In [62]:
input_words, target_words = get_batch(val_data, 0, 3)
pp.pprint(input_words)
pp.pprint(target_words)
input_words, target_words = get_batch(val_data, 3, 3)
pp.pprint(input_words)
pp.pprint(target_words)

tensor([[    0,    10,    15,    91],
        [    1,  3018,   735,    13],
        [17642,   187,   766,   496]])
tensor([[    1,  3018,   735,    13],
        [17642,   187,   766,   496],
        [17643,   827,   751,   131]])
tensor([[17643,   827,   751,   131],
        [    1,    19,  4659,  2200],
        [    0,    17,  2466,    22]])
tensor([[   1,   19, 4659, 2200],
        [   0,   17, 2466,   22],
        [   0, 3069,   39, 5521]])


### LSTM Cells in pytorch

In [67]:
# Create a toy example of LSTM: 
lstm = nn.LSTM(3, 3)  # Input dim is 3, output dim is 3
inputs = [torch.randn(1, 3) for _ in range(5)]  # make a sequence of length 5

# LSTMs expect inputs having 3 dimensions:
# - The first dimension is the temporal dimension, along which we (in our case) have the different words
# - The second dimension is the batch dimension, along which we stack the independant batches
# - The third dimension is the feature dimension, along which are the features of the vector representing the words

# In our toy case, we have inputs and outputs containing 3 features (third dimension !)
# We created a sequence of 5 different inputs (first dimension !)
# We don't use batch (the second dimension will have one lement)

# We need an initial hidden state, of the right sizes for dimension 2/3, but with only one temporal element:
# Here, it is:
hidden = (torch.randn(1, 1, 3),
          torch.randn(1, 1, 3))
# Why do we create a tuple of two tensors ? Because we use LSTMs: remember that they use two sets of weights,
# and two hidden states (Hidden state, and Cell state).
# If you don't remember, read: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
# If we used a classic RNN, we would simply have:
# hidden = torch.randn(1, 1, 3)

# The naive way of applying a lstm to inputs is to apply it one step at a time, and loop through the sequence
for i in inputs:
    # After each step, hidden contains the hidden states (remember, it's a tuple of two states).
    out, hidden = lstm(i.view(1, 1, -1), hidden)
    
# Alternatively, we can do the entire sequence all at once.
# The first value returned by LSTM is all of the Hidden states throughout the sequence.
# The second is just the most recent Hidden state and Cell state (you can compare the values)
# The reason for this is that:
# "out" will give you access to all hidden states in the sequence, for each temporal step
# "hidden" will allow you to continue the sequence and backpropagate later, with another sequence
inputs = torch.cat(inputs).view(len(inputs), 1, -1)
hidden = (torch.randn(1, 1, 3), torch.randn(1, 1, 3))  # Re-initialize
out, hidden = lstm(inputs, hidden)
pp.pprint(out)
pp.pprint(hidden)

tensor([[[-0.0359, -0.0586,  0.2503]],

        [[ 0.0035, -0.1231,  0.0139]],

        [[ 0.1834, -0.0816, -0.1416]],

        [[-0.0704, -0.3577, -0.2444]],

        [[-0.0023, -0.3206, -0.1395]]], grad_fn=<StackBackward>)
(tensor([[[-0.0023, -0.3206, -0.1395]]], grad_fn=<StackBackward>),
 tensor([[[-0.0048, -0.5989, -0.3442]]], grad_fn=<StackBackward>))


### Creating our own LSTM Model

In [76]:
# Models are usually implemented as custom nn.Module subclass
# We need to redefine the __init__ method, which creates the object
# We also need to redefine the forward method, which transform the input into outputs
# We can also add any method that we need: here, in order to initiate weights in the model

class LSTMModel(nn.Module):
    def __init__(self, ntoken, ninp, nhid, nlayers, dropout=0.5):
        super(LSTMModel, self).__init__()
        # Create a dropout object to use on layers for regularization
        self.drop = nn.Dropout(dropout)
        # Create an encoder - which is an embedding layer
        self.encoder = nn.Embedding(ntoken, ninp)
        # Create the LSTM layers - find out how to stack them !
        self.rnn = nn.LSTM(ninp, nhid, nlayers, dropout=dropout)
        # Create what we call the decoder: a linear transformation to map the hidden state into scores for all words in the vocabulary
        # (Note that the softmax application function will be applied out of the model)
        self.decoder = nn.Linear(nhid, ntoken)
        
        # Initialize non-reccurent weights 
        self.init_weights()

        self.ninp = ninp
        self.nhid = nhid
        self.nlayers = nlayers
        
    def init_weights(self):
        # Initialize the encoder and decoder weights with the uniform distribution
        initrange = 0.1
        self.encoder.weight.data.uniform_(-initrange, initrange)
        self.decoder.bias.data.fill_(0)
        self.decoder.weight.data.uniform_(-initrange, initrange)
        
    def init_hidden(self, batch_size):
        # Initialize the hidden state and cell state to zero, with the right sizes
        weight = next(self.parameters())
        return (weight.new_zeros(self.nlayers, batch_size, self.nhid),
                weight.new_zeros(self.nlayers, batch_size, self.nhid))    

    def forward(self, input, hidden, return_h=False):
        # Process the input
        emb = self.drop(self.encoder(input))   
        
        # Apply the LSTMs
        output, hidden = self.rnn(emb, hidden)
        
        # Decode into scores
        output = self.drop(output)      
        decoded = self.decoder(output)
        return decoded, hidden

### Building the Model

In [77]:
# Set the random seed manually for reproducibility.
torch.manual_seed(1)

# If you have Cuda installed and a GPU available
cuda = False
if torch.cuda.is_available():
    if not cuda:
        print("WARNING: You have a CUDA device, so you should probably choose cuda = True")
        
device = torch.device("cuda" if cuda else "cpu")

In [78]:
embedding_size = 200
hidden_size = 200
layers = 2
dropout = 0.5

###############################################################################
# Build the model
###############################################################################

vocab_size = len(corpus.dictionary)
model = LSTMModel(vocab_size, embedding_size, hidden_size, layers, dropout).to(device)
params = list(model.parameters())
criterion = nn.CrossEntropyLoss()

In [79]:
lr = 10.0
optimizer = 'sgd'
wdecay = 1.2e-6
# For gradient clipping
clip = 0.25

if optimizer == 'sgd':
    optim = torch.optim.SGD(params, lr=lr, weight_decay=wdecay)
if optimizer == 'adam':
    optim = torch.optim.Adam(params, lr=lr, weight_decay=wdecay)

In [90]:
# Let's think about gradient propagation:
# We plan to keep the second ouput of the LSTM layer (the hidden/cell states) to initialize
# the next call to LSTM. In this way, we can back-propagate the gradient for as long as we want.
# However, this put a huge strain on the memory used by the model, since it implies retaining
# a always-growing number of tensors of gradients in the cache.
# We decide to not backpropagate through time beyond the current sequence ! 
# We use a specific function to cut the 'hidden/state cell' states from their previous dependencies
# before using them to initialize the next call to the LSTM.
# This is done with the .detach() function.

def repackage_hidden(h):
    """Wraps hidden states in new Tensors, to detach them from their history."""
    if isinstance(h, torch.Tensor):
        return h.detach()
    else:
        return tuple(repackage_hidden(v) for v in h)

In [91]:
# Other global parameters
epochs = 10
seq_len = 30
log_interval = 10
save = 'model.pt'

In [116]:
def evaluate(data_source):
    # Turn on evaluation mode which disables dropout.
    model.eval()
    total_loss = 0.
    hidden = model.init_hidden(eval_batch_size)
    with torch.no_grad():
        for i in range(0, data_source.size(0) - 1, seq_len):
            data, targets = get_batch(data_source, i, seq_len)
            output, hidden = model(data, hidden)
            hidden = repackage_hidden(hidden)
            total_loss += len(data) * criterion(output.view(-1, vocab_size), targets.view(-1)).item()
    return total_loss / (len(data_source) - 1)

In [117]:
def train():
    # Turn on training mode which enables dropout.
    model.train()
    total_loss = 0.
    start_time = time.time()
    hidden = model.init_hidden(batch_size)
    for batch, i in enumerate(range(0, train_data.size(0) - 1, seq_len)):
        data, targets = get_batch(train_data, i, seq_len)
        # Starting each batch, we detach the hidden state from how it was previously produced.
        # If we didn't, the model would try backpropagating all the way to start of the dataset.
        hidden = repackage_hidden(hidden)
        optim.zero_grad()
        
        output, hidden = model(data, hidden)
        loss = criterion(output.view(-1, vocab_size), targets.view(-1))
        loss.backward()

        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        torch.nn.utils.clip_grad_norm_(params, clip)
        optim.step()
        
        total_loss += loss.data

        if batch % log_interval == 0 and batch > 0:
            cur_loss = total_loss / log_interval
            elapsed = time.time() - start_time
            print('| epoch {:3d} | {:5d}/{:5d} batches | lr {:02.2f} | ms/batch {:5.2f} | '
                    'loss {:5.2f} | ppl {:8.2f}'.format(
                epoch, batch, len(train_data) // seq_len, lr,
                elapsed * 1000 / log_interval, cur_loss, math.exp(cur_loss)))
            total_loss = 0
            start_time = time.time()

In [118]:
# Loop over epochs.
best_val_loss = None

# At any point you can hit Ctrl + C to break out of training early.
try:
    for epoch in range(1, epochs+1):
        epoch_start_time = time.time()
        train()
        val_loss = evaluate(val_data)
        print('-' * 89)
        print('| end of epoch {:3d} | time: {:5.2f}s | valid loss {:5.2f} | '
                'valid ppl {:8.2f}'.format(epoch, (time.time() - epoch_start_time),
                                           val_loss, math.exp(val_loss)))
        print('-' * 89)
        # Save the model if the validation loss is the best we've seen so far.
        if not best_val_loss or val_loss < best_val_loss:
            with open(save, 'wb') as f:
                torch.save(model, f)
            best_val_loss = val_loss
        else:
            # Anneal the learning rate if no improvement has been seen in the validation dataset.
            lr /= 4.0
except KeyboardInterrupt:
    print('-' * 89)
    print('Exiting from training early')

# Load the best saved model.
with open(save, 'rb') as f:
    model = torch.load(f)
    # after load the rnn params are not a continuous chunk of memory
    # this makes them a continuous chunk, and will speed up forward pass
    model.rnn.flatten_parameters()

# Run on test data.
test_loss = evaluate(test_data)
print('=' * 89)
print('| End of training | test loss {:5.2f} | test ppl {:8.2f}'.format(
    test_loss, math.exp(test_loss)))
print('=' * 89)

| epoch   1 |    10/   91 batches | lr 10.00 | ms/batch 891.45 | loss  7.10 | ppl  1213.63
| epoch   1 |    20/   91 batches | lr 10.00 | ms/batch 814.38 | loss  6.47 | ppl   648.23
| epoch   1 |    30/   91 batches | lr 10.00 | ms/batch 826.88 | loss  6.46 | ppl   639.02
| epoch   1 |    40/   91 batches | lr 10.00 | ms/batch 862.41 | loss  6.44 | ppl   625.17
| epoch   1 |    50/   91 batches | lr 10.00 | ms/batch 868.89 | loss  6.39 | ppl   596.61
| epoch   1 |    60/   91 batches | lr 10.00 | ms/batch 869.08 | loss  6.43 | ppl   622.77
| epoch   1 |    70/   91 batches | lr 10.00 | ms/batch 889.35 | loss  6.39 | ppl   596.87
| epoch   1 |    80/   91 batches | lr 10.00 | ms/batch 863.10 | loss  6.40 | ppl   602.38
| epoch   1 |    90/   91 batches | lr 10.00 | ms/batch 863.47 | loss  6.41 | ppl   605.64
-----------------------------------------------------------------------------------------
| end of epoch   1 | time: 84.94s | valid loss  6.25 | valid ppl   515.44
-----------------

  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "


| epoch   2 |    10/   91 batches | lr 10.00 | ms/batch 953.08 | loss  7.00 | ppl  1093.45
| epoch   2 |    20/   91 batches | lr 10.00 | ms/batch 872.14 | loss  6.38 | ppl   592.45
| epoch   2 |    30/   91 batches | lr 10.00 | ms/batch 868.43 | loss  6.35 | ppl   573.16
| epoch   2 |    40/   91 batches | lr 10.00 | ms/batch 860.24 | loss  6.35 | ppl   574.18
| epoch   2 |    50/   91 batches | lr 10.00 | ms/batch 891.05 | loss  6.32 | ppl   556.25
| epoch   2 |    60/   91 batches | lr 10.00 | ms/batch 854.66 | loss  6.36 | ppl   577.52
| epoch   2 |    70/   91 batches | lr 10.00 | ms/batch 887.27 | loss  6.30 | ppl   541.90
| epoch   2 |    80/   91 batches | lr 10.00 | ms/batch 892.64 | loss  6.31 | ppl   550.57
| epoch   2 |    90/   91 batches | lr 10.00 | ms/batch 873.40 | loss  6.33 | ppl   563.02
-----------------------------------------------------------------------------------------
| end of epoch   2 | time: 87.07s | valid loss  6.19 | valid ppl   487.63
-----------------

-----------------------------------------------------------------------------------------
| end of epoch   9 | time: 89.38s | valid loss  5.90 | valid ppl   363.45
-----------------------------------------------------------------------------------------
| epoch  10 |    10/   91 batches | lr 2.50 | ms/batch 953.72 | loss  6.35 | ppl   569.92
| epoch  10 |    20/   91 batches | lr 2.50 | ms/batch 889.94 | loss  5.75 | ppl   315.50
| epoch  10 |    30/   91 batches | lr 2.50 | ms/batch 896.64 | loss  5.76 | ppl   317.48
| epoch  10 |    40/   91 batches | lr 2.50 | ms/batch 876.19 | loss  5.77 | ppl   321.24
| epoch  10 |    50/   91 batches | lr 2.50 | ms/batch 865.93 | loss  5.73 | ppl   309.40
| epoch  10 |    60/   91 batches | lr 2.50 | ms/batch 884.51 | loss  5.75 | ppl   313.00
| epoch  10 |    70/   91 batches | lr 2.50 | ms/batch 963.31 | loss  5.70 | ppl   299.91
| epoch  10 |    80/   91 batches | lr 2.50 | ms/batch 912.85 | loss  5.72 | ppl   303.71
| epoch  10 |    90/   91 