# Character-Level LSTM in PyTorch

In this notebook, I'll construct a character-level LSTM with PyTorch. The network will train character by character on some text, then generate new text character by character. As an example, I will train on Anna Karenina. **This model will be able to generate new text based on the text from the book!**

This network is based off of Andrej Karpathy's [post on RNNs](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) and [implementation in Torch](https://github.com/karpathy/char-rnn). Below is the general architecture of the character-wise RNN.

<img src="assets/charseq.jpeg" width="500">

First let's load in our required resources for data loading and model creation.

In [1]:
import numpy as np
import torch
from torch import nn
import torch.nn.functional as F

## Load in Data

Then, we'll load the Anna Karenina text file and convert it into integers for our network to use. 

In [2]:
# open text file and read in data as `text`
with open('data/anna.txt', 'r') as f:
    text = f.read()

Let's check out the first 100 characters, make sure everything is peachy. According to the [American Book Review](http://americanbookreview.org/100bestlines.asp), this is the 6th best first line of a book ever.

In [3]:
text[:100]

'Chapter 1\n\n\nHappy families are all alike; every unhappy family is unhappy in its own\nway.\n\nEverythin'

### Tokenization

In the cells, below, I'm creating a couple **dictionaries** to convert the characters to and from integers. Encoding the characters as integers makes it easier to use as input in the network.

In [4]:
# encode the text and map each character to an integer and vice versa

# we create two dictionaries:
# 1. int2char, which maps integers to characters
# 2. char2int, which maps characters to unique integers
chars = tuple(set(text))
int2char = dict(enumerate(chars))
char2int = {ch: ii for ii, ch in int2char.items()}

# encode the text
encoded = np.array([char2int[ch] for ch in text])

And we can see those same characters from above, encoded as integers.

In [5]:
encoded[:100]

array([47, 55, 39, 74, 63,  2, 75, 28, 58, 16, 16, 16, 82, 39, 74, 74, 24,
       28, 78, 39, 79, 36,  3, 36,  2, 77, 28, 39, 75,  2, 28, 39,  3,  3,
       28, 39,  3, 36,  5,  2,  1, 28,  2, 69,  2, 75, 24, 28, 42,  4, 55,
       39, 74, 74, 24, 28, 78, 39, 79, 36,  3, 24, 28, 36, 77, 28, 42,  4,
       55, 39, 74, 74, 24, 28, 36,  4, 28, 36, 63, 77, 28, 19, 14,  4, 16,
       14, 39, 24, 60, 16, 16, 20, 69,  2, 75, 24, 63, 55, 36,  4])

## Pre-processing the data

As you can see in our char-RNN image above, our LSTM expects an input that is **one-hot encoded** meaning that each character is converted into an integer (via our created dictionary) and *then* converted into a column vector where only it's corresponding integer index will have the value of 1 and the rest of the vector will be filled with 0's. Since we're one-hot encoding the data, let's make a function to do that!


In [6]:
def one_hot_encode(arr, n_labels):
    
    # Initialize the the encoded array
    one_hot = np.zeros((arr.size, n_labels), dtype=np.float32)
    
    # Fill the appropriate elements with ones
    one_hot[np.arange(one_hot.shape[0]), arr.flatten()] = 1.
    #print(one_hot.shape)
    # Finally reshape it to get back to the original array
    one_hot = one_hot.reshape((*arr.shape, n_labels))
    #print(one_hot.shape)
    return one_hot

In [7]:
# check that the function works as expected
test_seq = np.array([[3, 5, 1]])
one_hot = one_hot_encode(test_seq, 8)

print(one_hot)

[[[0. 0. 0. 1. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 1. 0. 0.]
  [0. 1. 0. 0. 0. 0. 0. 0.]]]


## Making training mini-batches


To train on this data, we also want to create mini-batches for training. Remember that we want our batches to be multiple sequences of some desired number of sequence steps. Considering a simple example, our batches would look like this:

<img src="assets/sequence_batching@1x.png" width=500px>


<br>

In this example, we'll take the encoded characters (passed in as the `arr` parameter) and split them into multiple sequences, given by `batch_size`. Each of our sequences will be `seq_length` long.

### Creating Batches

**1. The first thing we need to do is discard some of the text so we only have completely full mini-batches. **

Each batch contains $N \times M$ characters, where $N$ is the batch size (the number of sequences in a batch) and $M$ is the seq_length or number of time steps in a sequence. Then, to get the total number of batches, $K$, that we can make from the array `arr`, you divide the length of `arr` by the number of characters per batch. Once you know the number of batches, you can get the total number of characters to keep from `arr`, $N * M * K$.

**2. After that, we need to split `arr` into $N$ batches. ** 

You can do this using `arr.reshape(size)` where `size` is a tuple containing the dimensions sizes of the reshaped array. We know we want $N$ sequences in a batch, so let's make that the size of the first dimension. For the second dimension, you can use `-1` as a placeholder in the size, it'll fill up the array with the appropriate data for you. After this, you should have an array that is $N \times (M * K)$.

**3. Now that we have this array, we can iterate through it to get our mini-batches. **

The idea is each batch is a $N \times M$ window on the $N \times (M * K)$ array. For each subsequent batch, the window moves over by `seq_length`. We also want to create both the input and target arrays. Remember that the targets are just the inputs shifted over by one character. The way I like to do this window is use `range` to take steps of size `n_steps` from $0$ to `arr.shape[1]`, the total number of tokens in each sequence. That way, the integers you get from `range` always point to the start of a batch, and each window is `seq_length` wide.

> **TODO:** Write the code for creating batches in the function below. The exercises in this notebook _will not be easy_. I've provided a notebook with solutions alongside this notebook. If you get stuck, checkout the solutions. The most important thing is that you don't copy and paste the code into here, **type out the solution code yourself.**

In [8]:
def get_batches(arr, batch_size, seq_length):
    '''Create a generator that returns batches of size
       batch_size x seq_length from arr.
       
       Arguments
       ---------
       arr: Array you want to make batches from
       batch_size: Batch size, the number of sequences per batch
       seq_length: Number of encoded chars in a sequence
    '''
    
    ## TODO: Get the number of batches we can make
    n_batches = int(arr.shape[0]/(batch_size*seq_length))
    
    ## TODO: Keep only enough characters to make full batches
    arr = arr[:n_batches*batch_size*seq_length]
    
    ## TODO: Reshape into batch_size rows
    arr = arr.reshape((batch_size,-1))
    
    ## TODO: Iterate over the batches using a window of size seq_length
    for n in range(0, arr.shape[1], seq_length):
        # The features
        x = arr[:,range(n,n+seq_length)]
        # The targets, shifted by one
        # Note: since we shaped x fit exact # of batches, we'll get an index error when we try to access last
        #       predicted col b/c it's 1 col past the arr.  I noticed this when I checked my answer vs the solution
        #       notebook.
        try:
            y = arr[:,range(n+1,n+seq_length+1)]
        except IndexError:
            y = np.zeros_like(x)
            y[:,:-1], y[:,-1] = x[:,1:], arr[:,0]
        yield x, y

### Test Your Implementation

Now I'll make some data sets and we can check out what's going on as we batch data. Here, as an example, I'm going to use a batch size of 8 and 50 sequence steps.

In [9]:
batches = get_batches(encoded, 8, 50)
x, y = next(batches)

In [10]:
# printing out the first 10 items in a sequence
print('x\n', x[:10, :10])
print('\ny\n', y[:10, :10])

x
 [[47 55 39 74 63  2 75 28 58 16]
 [77 19  4 28 63 55 39 63 28 39]
 [ 2  4 66 28 19 75 28 39 28 78]
 [77 28 63 55  2 28 59 55 36  2]
 [28 77 39 14 28 55  2 75 28 63]
 [59 42 77 77 36 19  4 28 39  4]
 [28  8  4  4 39 28 55 39 66 28]
 [26 33  3 19  4 77  5 24 60 28]]

y
 [[55 39 74 63  2 75 28 58 16 16]
 [19  4 28 63 55 39 63 28 39 63]
 [ 4 66 28 19 75 28 39 28 78 19]
 [28 63 55  2 28 59 55 36  2 78]
 [77 39 14 28 55  2 75 28 63  2]
 [42 77 77 36 19  4 28 39  4 66]
 [ 8  4  4 39 28 55 39 66 28 77]
 [33  3 19  4 77  5 24 60 28 54]]


If you implemented `get_batches` correctly, the above output should look something like 
```
x
 [[25  8 60 11 45 27 28 73  1  2]
 [17  7 20 73 45  8 60 45 73 60]
 [27 20 80 73  7 28 73 60 73 65]
 [17 73 45  8 27 73 66  8 46 27]
 [73 17 60 12 73  8 27 28 73 45]
 [66 64 17 17 46  7 20 73 60 20]
 [73 76 20 20 60 73  8 60 80 73]
 [47 35 43  7 20 17 24 50 37 73]]

y
 [[ 8 60 11 45 27 28 73  1  2  2]
 [ 7 20 73 45  8 60 45 73 60 45]
 [20 80 73  7 28 73 60 73 65  7]
 [73 45  8 27 73 66  8 46 27 65]
 [17 60 12 73  8 27 28 73 45 27]
 [64 17 17 46  7 20 73 60 20 80]
 [76 20 20 60 73  8 60 80 73 17]
 [35 43  7 20 17 24 50 37 73 36]]
 ```
 although the exact numbers may be different. Check to make sure the data is shifted over one step for `y`.

---
## Defining the network with PyTorch

Below is where you'll define the network.

<img src="assets/charRNN.png" width=500px>

Next, you'll use PyTorch to define the architecture of the network. We start by defining the layers and operations we want. Then, define a method for the forward pass. You've also been given a method for predicting characters.

### Model Structure

In `__init__` the suggested structure is as follows:
* Create and store the necessary dictionaries (this has been done for you)
* Define an LSTM layer that takes as params: an input size (the number of characters), a hidden layer size `n_hidden`, a number of layers `n_layers`, a dropout probability `drop_prob`, and a batch_first boolean (True, since we are batching)
* Define a dropout layer with `drop_prob`
* Define a fully-connected layer with params: input size `n_hidden` and output size (the number of characters)
* Finally, initialize the weights (again, this has been given)

Note that some parameters have been named and given in the `__init__` function, and we use them and store them by doing something like `self.drop_prob = drop_prob`.

---
### LSTM Inputs/Outputs

You can create a basic [LSTM layer](https://pytorch.org/docs/stable/nn.html#lstm) as follows

```python
self.lstm = nn.LSTM(input_size, n_hidden, n_layers, 
                            dropout=drop_prob, batch_first=True)
```

where `input_size` is the number of characters this cell expects to see as sequential input, and `n_hidden` is the number of units in the hidden layers in the cell. And we can add dropout by adding a dropout parameter with a specified probability; this will automatically add dropout to the inputs or outputs. Finally, in the `forward` function, we can stack up the LSTM cells into layers using `.view`. With this, you pass in a list of cells and it will send the output of one cell into the next cell.

We also need to create an initial hidden state of all zeros. This is done like so

```python
self.init_hidden()
```

In [11]:
# check if GPU is available
train_on_gpu = torch.cuda.is_available()
if(train_on_gpu):
    print('Training on GPU!')
else: 
    print('No GPU available, training on CPU; consider making n_epochs very small.')

Training on GPU!


In [31]:
class CharRNN(nn.Module):
    
    def __init__(self, tokens, n_hidden=256, n_layers=2,
                               drop_prob=0.5, lr=0.001):
        super().__init__()
        self.drop_prob = drop_prob
        self.n_layers = n_layers
        self.n_hidden = n_hidden
        self.lr = lr
        
        # creating character dictionaries
        self.chars = tokens
        self.int2char = dict(enumerate(self.chars))
        self.char2int = {ch: ii for ii, ch in self.int2char.items()}
        
        ## TODO: define the layers of the model
        
        # Key point: the input_size for LSTM is the length of the one-hot encoded character.
        #            In other words, input_size is the length of the vector for input to a single LSTM cell.
        #            Don't confuse this with the # of unrolled cells.
        #            Also, it seesm like the description above by Udacity ("input_size is the number of characters this cell expects to see as sequential input")
        #            is wrong or at least confusing.
        #            I think the # of unrolled cells (aka input seq. length) is inferred from shape of inputs
        #            when model is called to get outpouts.
        #            e.g. output, h = net(inputs, h) in the train method in the following notebook cell.
        input_size = len(self.chars)
        self.lstm = nn.LSTM(input_size, n_hidden, n_layers, dropout=drop_prob, batch_first=True)
        self.dropout = nn.Dropout(drop_prob)
        output_size = len(self.chars)
        self.fc1 = nn.Linear(n_hidden, output_size)
    
    def forward(self, x, hidden):
        ''' Forward pass through the network. 
            These inputs are x, and the hidden/cell state `hidden`. '''
                
        ## TODO: Get the outputs and the new hidden state from the lstm
        
        # My question:
        #   In Simple_RNN notebook, we had to do hidden = hidden.data.  Why didn't we have to do that in this case?
        # Answer:
        #   In both cases this part is actually done in training!
        output, hidden = self.lstm(x, hidden)
        x = self.dropout(output)
        
        #took from solutions...didn't seem like this was explained in videos...not sure how I should have known this
        #  Here is comment from sol:
        #                           "Stack up LSTM outputs using view
        #                            you may need to use contiguous to reshape the output"
        x = x.contiguous().view(-1, self.n_hidden)
        
        out = self.fc1(x)
        # return the final output and the hidden state
        return out, hidden
    
    
    def init_hidden(self, batch_size):
        ''' Initializes hidden state '''
        # Create two new tensors with sizes n_layers x batch_size x n_hidden,
        # initialized to zero, for hidden state and cell state of LSTM
        # why do we create weight this way since we use new method?  I guess it could be to make weight the same variable
        # type as next(self.parameters()).data which I think is torch.nn.parameter.Parameter
        weight = next(self.parameters()).data
        
        if (train_on_gpu):
            hidden = (weight.new(self.n_layers, batch_size, self.n_hidden).zero_().cuda(),
                  weight.new(self.n_layers, batch_size, self.n_hidden).zero_().cuda())
        else:
            hidden = (weight.new(self.n_layers, batch_size, self.n_hidden).zero_(),
                      weight.new(self.n_layers, batch_size, self.n_hidden).zero_())
        
        return hidden
        

## Time to train

The train function gives us the ability to set the number of epochs, the learning rate, and other parameters.

Below we're using an Adam optimizer and cross entropy loss since we are looking at character class scores as output. We calculate the loss and perform backpropagation, as usual!

A couple of details about training: 
>* Within the batch loop, we detach the hidden state from its history; this time setting it equal to a new *tuple* variable because an LSTM has a hidden state that is a tuple of the hidden and cell states.
* We use [`clip_grad_norm_`](https://pytorch.org/docs/stable/_modules/torch/nn/utils/clip_grad.html) to help prevent exploding gradients.

In [32]:
def train(net, data, epochs=10, batch_size=10, seq_length=50, lr=0.001, clip=5, val_frac=0.1, print_every=10):
    ''' Training a network 
    
        Arguments
        ---------
        
        net: CharRNN network
        data: text data to train the network
        epochs: Number of epochs to train
        batch_size: Number of mini-sequences per mini-batch, aka batch size
        seq_length: Number of character steps per mini-batch
        lr: learning rate
        clip: gradient clipping
        val_frac: Fraction of data to hold out for validation
        print_every: Number of steps for printing training and validation loss
    
    '''
    net.train()
    
    opt = torch.optim.Adam(net.parameters(), lr=lr)
    criterion = nn.CrossEntropyLoss()
    
    # create training and validation data
    val_idx = int(len(data)*(1-val_frac))
    data, val_data = data[:val_idx], data[val_idx:]
    
    if(train_on_gpu):
        net.cuda()
    
    counter = 0
    n_chars = len(net.chars)
    for e in range(epochs):
        # initialize hidden state
        h = net.init_hidden(batch_size)
        
        for x, y in get_batches(data, batch_size, seq_length):
            counter += 1
            
            # One-hot encode our data and make them Torch tensors
            x = one_hot_encode(x, n_chars)
            inputs, targets = torch.from_numpy(x), torch.from_numpy(y)
            
            if(train_on_gpu):
                inputs, targets = inputs.cuda(), targets.cuda()

            # Creating new variables for the hidden state, otherwise
            # we'd backprop through the entire training history
            h = tuple([each.data for each in h])

            # zero accumulated gradients
            net.zero_grad()
            
            # get the output from the model
            output, h = net(inputs, h)
            
            # calculate the loss and perform backprop
            #got on error using udacity's code which used view.  error said to try reshape
            #loss = criterion(output, targets.view(batch_size*seq_length).long())
            loss = criterion(output, targets.reshape(batch_size*seq_length).long())
            loss.backward()
            # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
            nn.utils.clip_grad_norm_(net.parameters(), clip)
            opt.step()
            
            # loss stats
            if counter % print_every == 0:
                # Get validation loss
                val_h = net.init_hidden(batch_size)
                val_losses = []
                net.eval()
                for x, y in get_batches(val_data, batch_size, seq_length):
                    # One-hot encode our data and make them Torch tensors
                    x = one_hot_encode(x, n_chars)
                    x, y = torch.from_numpy(x), torch.from_numpy(y)
                    
                    # Creating new variables for the hidden state, otherwise
                    # we'd backprop through the entire training history
                    val_h = tuple([each.data for each in val_h])
                    
                    inputs, targets = x, y
                    if(train_on_gpu):
                        inputs, targets = inputs.cuda(), targets.cuda()

                    output, val_h = net(inputs, val_h)
                    #got on error using udacity's code which used view.  error said to try reshape
                    #val_loss = criterion(output, targets.view(batch_size*seq_length).long())
                    val_loss = criterion(output, targets.reshape(batch_size*seq_length).long())
                
                    val_losses.append(val_loss.item())
                
                net.train() # reset to train mode after iterationg through validation data
                
                print("Epoch: {}/{}...".format(e+1, epochs),
                      "Step: {}...".format(counter),
                      "Loss: {:.4f}...".format(loss.item()),
                      "Val Loss: {:.4f}".format(np.mean(val_losses)))

## Instantiating the model

Now we can actually train the network. First we'll create the network itself, with some given hyperparameters. Then, define the mini-batches sizes, and start training!

In [33]:
## TODO: set your model hyperparameters
# define and print the net
n_hidden= 64
n_layers= 2

net = CharRNN(chars, n_hidden, n_layers)
print(net)

CharRNN(
  (lstm): LSTM(83, 64, num_layers=2, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.5, inplace=False)
  (fc1): Linear(in_features=64, out_features=83, bias=True)
)


### Set your training hyperparameters!

In [34]:
batch_size = 32
seq_length = 12
n_epochs =  8 #20 # start small if you are just testing initial behavior

# train the model
#learning_rate = 0.001
learning_rate = 0.01
train(net, encoded, epochs=n_epochs, batch_size=batch_size, seq_length=seq_length, lr=learning_rate, print_every=10)

Epoch: 1/8... Step: 10... Loss: 3.2626... Val Loss: 3.1938
Epoch: 1/8... Step: 20... Loss: 3.3284... Val Loss: 3.1510
Epoch: 1/8... Step: 30... Loss: 3.0906... Val Loss: 3.1414
Epoch: 1/8... Step: 40... Loss: 3.1469... Val Loss: 3.1290
Epoch: 1/8... Step: 50... Loss: 3.2085... Val Loss: 3.1266
Epoch: 1/8... Step: 60... Loss: 3.1018... Val Loss: 3.1096
Epoch: 1/8... Step: 70... Loss: 2.9735... Val Loss: 3.0983
Epoch: 1/8... Step: 80... Loss: 3.0171... Val Loss: 3.0415
Epoch: 1/8... Step: 90... Loss: 2.9646... Val Loss: 2.9979
Epoch: 1/8... Step: 100... Loss: 2.9873... Val Loss: 2.9327
Epoch: 1/8... Step: 110... Loss: 2.8970... Val Loss: 2.8796
Epoch: 1/8... Step: 120... Loss: 2.8807... Val Loss: 2.8233
Epoch: 1/8... Step: 130... Loss: 2.7849... Val Loss: 2.7547
Epoch: 1/8... Step: 140... Loss: 2.8048... Val Loss: 2.7031
Epoch: 1/8... Step: 150... Loss: 2.7937... Val Loss: 2.6788
Epoch: 1/8... Step: 160... Loss: 2.7108... Val Loss: 2.6303
Epoch: 1/8... Step: 170... Loss: 2.5843... Val Lo

Epoch: 1/8... Step: 1380... Loss: 2.1741... Val Loss: 2.0163
Epoch: 1/8... Step: 1390... Loss: 1.9944... Val Loss: 2.0183
Epoch: 1/8... Step: 1400... Loss: 2.1777... Val Loss: 2.0177
Epoch: 1/8... Step: 1410... Loss: 2.1744... Val Loss: 2.0178
Epoch: 1/8... Step: 1420... Loss: 2.2433... Val Loss: 2.0158
Epoch: 1/8... Step: 1430... Loss: 2.1557... Val Loss: 2.0171
Epoch: 1/8... Step: 1440... Loss: 2.1137... Val Loss: 2.0087
Epoch: 1/8... Step: 1450... Loss: 1.9355... Val Loss: 1.9995
Epoch: 1/8... Step: 1460... Loss: 2.1855... Val Loss: 2.0063
Epoch: 1/8... Step: 1470... Loss: 2.1676... Val Loss: 2.0064
Epoch: 1/8... Step: 1480... Loss: 1.9632... Val Loss: 2.0022
Epoch: 1/8... Step: 1490... Loss: 2.0845... Val Loss: 1.9973
Epoch: 1/8... Step: 1500... Loss: 1.9398... Val Loss: 1.9995
Epoch: 1/8... Step: 1510... Loss: 2.2189... Val Loss: 2.0018
Epoch: 1/8... Step: 1520... Loss: 2.3405... Val Loss: 1.9981
Epoch: 1/8... Step: 1530... Loss: 2.1218... Val Loss: 1.9977
Epoch: 1/8... Step: 1540

Epoch: 1/8... Step: 2730... Loss: 2.0312... Val Loss: 1.9123
Epoch: 1/8... Step: 2740... Loss: 1.9484... Val Loss: 1.9131
Epoch: 1/8... Step: 2750... Loss: 2.0670... Val Loss: 1.9098
Epoch: 1/8... Step: 2760... Loss: 1.8818... Val Loss: 1.9093
Epoch: 1/8... Step: 2770... Loss: 2.0939... Val Loss: 1.9101
Epoch: 1/8... Step: 2780... Loss: 2.0496... Val Loss: 1.9069
Epoch: 1/8... Step: 2790... Loss: 2.0071... Val Loss: 1.9091
Epoch: 1/8... Step: 2800... Loss: 2.0300... Val Loss: 1.9083
Epoch: 1/8... Step: 2810... Loss: 1.9474... Val Loss: 1.9097
Epoch: 1/8... Step: 2820... Loss: 1.9941... Val Loss: 1.9042
Epoch: 1/8... Step: 2830... Loss: 2.0672... Val Loss: 1.9018
Epoch: 1/8... Step: 2840... Loss: 1.8289... Val Loss: 1.9079
Epoch: 1/8... Step: 2850... Loss: 2.0050... Val Loss: 1.9021
Epoch: 1/8... Step: 2860... Loss: 1.8946... Val Loss: 1.9004
Epoch: 1/8... Step: 2870... Loss: 2.0198... Val Loss: 1.9024
Epoch: 1/8... Step: 2880... Loss: 1.9676... Val Loss: 1.9013
Epoch: 1/8... Step: 2890

Epoch: 1/8... Step: 4080... Loss: 1.7457... Val Loss: 1.8683
Epoch: 1/8... Step: 4090... Loss: 2.0687... Val Loss: 1.8685
Epoch: 1/8... Step: 4100... Loss: 2.0592... Val Loss: 1.8676
Epoch: 1/8... Step: 4110... Loss: 1.9953... Val Loss: 1.8685
Epoch: 1/8... Step: 4120... Loss: 2.0862... Val Loss: 1.8711
Epoch: 1/8... Step: 4130... Loss: 1.9354... Val Loss: 1.8632
Epoch: 1/8... Step: 4140... Loss: 1.9445... Val Loss: 1.8626
Epoch: 1/8... Step: 4150... Loss: 1.9673... Val Loss: 1.8638
Epoch: 1/8... Step: 4160... Loss: 1.9704... Val Loss: 1.8625
Epoch: 1/8... Step: 4170... Loss: 1.9115... Val Loss: 1.8656
Epoch: 1/8... Step: 4180... Loss: 1.9877... Val Loss: 1.8703
Epoch: 1/8... Step: 4190... Loss: 1.9388... Val Loss: 1.8656
Epoch: 1/8... Step: 4200... Loss: 2.0393... Val Loss: 1.8591
Epoch: 1/8... Step: 4210... Loss: 1.7618... Val Loss: 1.8577
Epoch: 1/8... Step: 4220... Loss: 1.9341... Val Loss: 1.8576
Epoch: 1/8... Step: 4230... Loss: 1.8668... Val Loss: 1.8614
Epoch: 1/8... Step: 4240

Epoch: 2/8... Step: 5430... Loss: 2.0396... Val Loss: 1.8353
Epoch: 2/8... Step: 5440... Loss: 2.0224... Val Loss: 1.8348
Epoch: 2/8... Step: 5450... Loss: 2.0173... Val Loss: 1.8298
Epoch: 2/8... Step: 5460... Loss: 1.8342... Val Loss: 1.8283
Epoch: 2/8... Step: 5470... Loss: 1.7577... Val Loss: 1.8310
Epoch: 2/8... Step: 5480... Loss: 1.9547... Val Loss: 1.8373
Epoch: 2/8... Step: 5490... Loss: 1.8286... Val Loss: 1.8367
Epoch: 2/8... Step: 5500... Loss: 1.8838... Val Loss: 1.8345
Epoch: 2/8... Step: 5510... Loss: 1.9091... Val Loss: 1.8302
Epoch: 2/8... Step: 5520... Loss: 1.8122... Val Loss: 1.8319
Epoch: 2/8... Step: 5530... Loss: 1.9852... Val Loss: 1.8317
Epoch: 2/8... Step: 5540... Loss: 1.9233... Val Loss: 1.8386
Epoch: 2/8... Step: 5550... Loss: 1.8595... Val Loss: 1.8355
Epoch: 2/8... Step: 5560... Loss: 1.8573... Val Loss: 1.8304
Epoch: 2/8... Step: 5570... Loss: 1.8713... Val Loss: 1.8286
Epoch: 2/8... Step: 5580... Loss: 1.7459... Val Loss: 1.8300
Epoch: 2/8... Step: 5590

Epoch: 2/8... Step: 6780... Loss: 1.8229... Val Loss: 1.8170
Epoch: 2/8... Step: 6790... Loss: 2.1030... Val Loss: 1.8144
Epoch: 2/8... Step: 6800... Loss: 1.9057... Val Loss: 1.8162
Epoch: 2/8... Step: 6810... Loss: 1.9502... Val Loss: 1.8152
Epoch: 2/8... Step: 6820... Loss: 2.1271... Val Loss: 1.8152
Epoch: 2/8... Step: 6830... Loss: 1.8580... Val Loss: 1.8153
Epoch: 2/8... Step: 6840... Loss: 1.9429... Val Loss: 1.8173
Epoch: 2/8... Step: 6850... Loss: 1.7028... Val Loss: 1.8148
Epoch: 2/8... Step: 6860... Loss: 1.8296... Val Loss: 1.8114
Epoch: 2/8... Step: 6870... Loss: 1.8515... Val Loss: 1.8101
Epoch: 2/8... Step: 6880... Loss: 1.8439... Val Loss: 1.8105
Epoch: 2/8... Step: 6890... Loss: 1.9484... Val Loss: 1.8136
Epoch: 2/8... Step: 6900... Loss: 1.8936... Val Loss: 1.8091
Epoch: 2/8... Step: 6910... Loss: 2.0403... Val Loss: 1.8104
Epoch: 2/8... Step: 6920... Loss: 1.9957... Val Loss: 1.8157
Epoch: 2/8... Step: 6930... Loss: 2.0728... Val Loss: 1.8172
Epoch: 2/8... Step: 6940

Epoch: 2/8... Step: 8130... Loss: 1.9042... Val Loss: 1.8054
Epoch: 2/8... Step: 8140... Loss: 1.8897... Val Loss: 1.8047
Epoch: 2/8... Step: 8150... Loss: 1.9381... Val Loss: 1.8028
Epoch: 2/8... Step: 8160... Loss: 1.9529... Val Loss: 1.8028
Epoch: 2/8... Step: 8170... Loss: 1.9756... Val Loss: 1.8015
Epoch: 2/8... Step: 8180... Loss: 1.9101... Val Loss: 1.8031
Epoch: 2/8... Step: 8190... Loss: 1.9156... Val Loss: 1.8020
Epoch: 2/8... Step: 8200... Loss: 1.9839... Val Loss: 1.8028
Epoch: 2/8... Step: 8210... Loss: 1.8671... Val Loss: 1.7993
Epoch: 2/8... Step: 8220... Loss: 1.7940... Val Loss: 1.7970
Epoch: 2/8... Step: 8230... Loss: 1.9165... Val Loss: 1.7969
Epoch: 2/8... Step: 8240... Loss: 1.8304... Val Loss: 1.8013
Epoch: 2/8... Step: 8250... Loss: 1.8828... Val Loss: 1.8069
Epoch: 2/8... Step: 8260... Loss: 2.0859... Val Loss: 1.8031
Epoch: 2/8... Step: 8270... Loss: 1.9410... Val Loss: 1.8014
Epoch: 2/8... Step: 8280... Loss: 1.8167... Val Loss: 1.7987
Epoch: 2/8... Step: 8290

Epoch: 3/8... Step: 9480... Loss: 1.8104... Val Loss: 1.7966
Epoch: 3/8... Step: 9490... Loss: 2.0925... Val Loss: 1.7957
Epoch: 3/8... Step: 9500... Loss: 1.7716... Val Loss: 1.7936
Epoch: 3/8... Step: 9510... Loss: 1.8781... Val Loss: 1.7934
Epoch: 3/8... Step: 9520... Loss: 1.7725... Val Loss: 1.7945
Epoch: 3/8... Step: 9530... Loss: 2.0763... Val Loss: 1.7985
Epoch: 3/8... Step: 9540... Loss: 1.9224... Val Loss: 1.8033
Epoch: 3/8... Step: 9550... Loss: 1.8162... Val Loss: 1.7991
Epoch: 3/8... Step: 9560... Loss: 1.8426... Val Loss: 1.7947
Epoch: 3/8... Step: 9570... Loss: 1.8766... Val Loss: 1.7949
Epoch: 3/8... Step: 9580... Loss: 1.8063... Val Loss: 1.7947
Epoch: 3/8... Step: 9590... Loss: 1.8782... Val Loss: 1.7959
Epoch: 3/8... Step: 9600... Loss: 1.8036... Val Loss: 1.7956
Epoch: 3/8... Step: 9610... Loss: 1.7936... Val Loss: 1.7959
Epoch: 3/8... Step: 9620... Loss: 1.8983... Val Loss: 1.7952
Epoch: 3/8... Step: 9630... Loss: 1.9012... Val Loss: 1.7958
Epoch: 3/8... Step: 9640

Epoch: 3/8... Step: 10810... Loss: 1.8133... Val Loss: 1.7800
Epoch: 3/8... Step: 10820... Loss: 1.6933... Val Loss: 1.7819
Epoch: 3/8... Step: 10830... Loss: 1.9407... Val Loss: 1.7832
Epoch: 3/8... Step: 10840... Loss: 2.0846... Val Loss: 1.7862
Epoch: 3/8... Step: 10850... Loss: 1.8855... Val Loss: 1.7840
Epoch: 3/8... Step: 10860... Loss: 1.7149... Val Loss: 1.7811
Epoch: 3/8... Step: 10870... Loss: 1.8627... Val Loss: 1.7794
Epoch: 3/8... Step: 10880... Loss: 1.7771... Val Loss: 1.7798
Epoch: 3/8... Step: 10890... Loss: 1.8370... Val Loss: 1.7839
Epoch: 3/8... Step: 10900... Loss: 1.8665... Val Loss: 1.7847
Epoch: 3/8... Step: 10910... Loss: 1.8650... Val Loss: 1.7808
Epoch: 3/8... Step: 10920... Loss: 1.9426... Val Loss: 1.7822
Epoch: 3/8... Step: 10930... Loss: 1.9928... Val Loss: 1.7802
Epoch: 3/8... Step: 10940... Loss: 1.9734... Val Loss: 1.7861
Epoch: 3/8... Step: 10950... Loss: 1.8590... Val Loss: 1.7879
Epoch: 3/8... Step: 10960... Loss: 1.7374... Val Loss: 1.7852
Epoch: 3

Epoch: 3/8... Step: 12140... Loss: 1.8832... Val Loss: 1.7798
Epoch: 3/8... Step: 12150... Loss: 1.9231... Val Loss: 1.7813
Epoch: 3/8... Step: 12160... Loss: 1.8806... Val Loss: 1.7806
Epoch: 3/8... Step: 12170... Loss: 1.8153... Val Loss: 1.7799
Epoch: 3/8... Step: 12180... Loss: 2.0500... Val Loss: 1.7801
Epoch: 3/8... Step: 12190... Loss: 1.8560... Val Loss: 1.7810
Epoch: 3/8... Step: 12200... Loss: 1.8649... Val Loss: 1.7845
Epoch: 3/8... Step: 12210... Loss: 1.8997... Val Loss: 1.7801
Epoch: 3/8... Step: 12220... Loss: 1.8896... Val Loss: 1.7757
Epoch: 3/8... Step: 12230... Loss: 1.7992... Val Loss: 1.7803
Epoch: 3/8... Step: 12240... Loss: 1.8040... Val Loss: 1.7792
Epoch: 3/8... Step: 12250... Loss: 1.9124... Val Loss: 1.7796
Epoch: 3/8... Step: 12260... Loss: 1.8233... Val Loss: 1.7767
Epoch: 3/8... Step: 12270... Loss: 1.8425... Val Loss: 1.7768
Epoch: 3/8... Step: 12280... Loss: 1.9404... Val Loss: 1.7786
Epoch: 3/8... Step: 12290... Loss: 1.8327... Val Loss: 1.7771
Epoch: 3

Epoch: 3/8... Step: 13470... Loss: 1.8434... Val Loss: 1.7815
Epoch: 3/8... Step: 13480... Loss: 1.8451... Val Loss: 1.7827
Epoch: 3/8... Step: 13490... Loss: 1.8571... Val Loss: 1.7841
Epoch: 3/8... Step: 13500... Loss: 1.8600... Val Loss: 1.7796
Epoch: 3/8... Step: 13510... Loss: 1.9218... Val Loss: 1.7784
Epoch: 3/8... Step: 13520... Loss: 1.8373... Val Loss: 1.7785
Epoch: 3/8... Step: 13530... Loss: 1.8009... Val Loss: 1.7789
Epoch: 3/8... Step: 13540... Loss: 1.9108... Val Loss: 1.7802
Epoch: 3/8... Step: 13550... Loss: 1.9645... Val Loss: 1.7824
Epoch: 3/8... Step: 13560... Loss: 1.8711... Val Loss: 1.7801
Epoch: 3/8... Step: 13570... Loss: 1.8749... Val Loss: 1.7784
Epoch: 3/8... Step: 13580... Loss: 1.7978... Val Loss: 1.7750
Epoch: 3/8... Step: 13590... Loss: 1.8473... Val Loss: 1.7764
Epoch: 3/8... Step: 13600... Loss: 1.8831... Val Loss: 1.7780
Epoch: 3/8... Step: 13610... Loss: 2.0803... Val Loss: 1.7778
Epoch: 3/8... Step: 13620... Loss: 1.8346... Val Loss: 1.7757
Epoch: 3

Epoch: 4/8... Step: 14800... Loss: 1.9806... Val Loss: 1.7706
Epoch: 4/8... Step: 14810... Loss: 1.7696... Val Loss: 1.7712
Epoch: 4/8... Step: 14820... Loss: 1.7933... Val Loss: 1.7708
Epoch: 4/8... Step: 14830... Loss: 1.8649... Val Loss: 1.7709
Epoch: 4/8... Step: 14840... Loss: 1.8757... Val Loss: 1.7714
Epoch: 4/8... Step: 14850... Loss: 1.8259... Val Loss: 1.7723
Epoch: 4/8... Step: 14860... Loss: 1.8213... Val Loss: 1.7666
Epoch: 4/8... Step: 14870... Loss: 1.8520... Val Loss: 1.7629
Epoch: 4/8... Step: 14880... Loss: 1.8537... Val Loss: 1.7588
Epoch: 4/8... Step: 14890... Loss: 1.7469... Val Loss: 1.7593
Epoch: 4/8... Step: 14900... Loss: 1.9377... Val Loss: 1.7624
Epoch: 4/8... Step: 14910... Loss: 1.9840... Val Loss: 1.7644
Epoch: 4/8... Step: 14920... Loss: 1.9214... Val Loss: 1.7638
Epoch: 4/8... Step: 14930... Loss: 1.7901... Val Loss: 1.7702
Epoch: 4/8... Step: 14940... Loss: 1.7972... Val Loss: 1.7702
Epoch: 4/8... Step: 14950... Loss: 1.8280... Val Loss: 1.7689
Epoch: 4

Epoch: 4/8... Step: 16130... Loss: 1.9217... Val Loss: 1.7604
Epoch: 4/8... Step: 16140... Loss: 1.8210... Val Loss: 1.7632
Epoch: 4/8... Step: 16150... Loss: 1.8918... Val Loss: 1.7654
Epoch: 4/8... Step: 16160... Loss: 1.9027... Val Loss: 1.7642
Epoch: 4/8... Step: 16170... Loss: 1.8620... Val Loss: 1.7694
Epoch: 4/8... Step: 16180... Loss: 1.9470... Val Loss: 1.7701
Epoch: 4/8... Step: 16190... Loss: 1.8876... Val Loss: 1.7680
Epoch: 4/8... Step: 16200... Loss: 1.8380... Val Loss: 1.7660
Epoch: 4/8... Step: 16210... Loss: 2.0523... Val Loss: 1.7701
Epoch: 4/8... Step: 16220... Loss: 1.8473... Val Loss: 1.7711
Epoch: 4/8... Step: 16230... Loss: 1.8216... Val Loss: 1.7689
Epoch: 4/8... Step: 16240... Loss: 1.8317... Val Loss: 1.7670
Epoch: 4/8... Step: 16250... Loss: 1.7662... Val Loss: 1.7721
Epoch: 4/8... Step: 16260... Loss: 1.7984... Val Loss: 1.7697
Epoch: 4/8... Step: 16270... Loss: 1.8211... Val Loss: 1.7693
Epoch: 4/8... Step: 16280... Loss: 1.9329... Val Loss: 1.7667
Epoch: 4

Epoch: 4/8... Step: 17460... Loss: 2.0085... Val Loss: 1.7658
Epoch: 4/8... Step: 17470... Loss: 1.8136... Val Loss: 1.7672
Epoch: 4/8... Step: 17480... Loss: 1.9304... Val Loss: 1.7669
Epoch: 4/8... Step: 17490... Loss: 1.8573... Val Loss: 1.7690
Epoch: 4/8... Step: 17500... Loss: 1.8637... Val Loss: 1.7686
Epoch: 4/8... Step: 17510... Loss: 1.9740... Val Loss: 1.7673
Epoch: 4/8... Step: 17520... Loss: 1.7731... Val Loss: 1.7639
Epoch: 4/8... Step: 17530... Loss: 1.7706... Val Loss: 1.7639
Epoch: 4/8... Step: 17540... Loss: 1.7924... Val Loss: 1.7616
Epoch: 4/8... Step: 17550... Loss: 1.6739... Val Loss: 1.7642
Epoch: 4/8... Step: 17560... Loss: 1.9768... Val Loss: 1.7645
Epoch: 4/8... Step: 17570... Loss: 1.9024... Val Loss: 1.7628
Epoch: 4/8... Step: 17580... Loss: 1.7937... Val Loss: 1.7648
Epoch: 4/8... Step: 17590... Loss: 1.8570... Val Loss: 1.7645
Epoch: 4/8... Step: 17600... Loss: 1.9101... Val Loss: 1.7642
Epoch: 4/8... Step: 17610... Loss: 1.8788... Val Loss: 1.7638
Epoch: 4

Epoch: 5/8... Step: 18790... Loss: 1.7287... Val Loss: 1.7615
Epoch: 5/8... Step: 18800... Loss: 1.8863... Val Loss: 1.7614
Epoch: 5/8... Step: 18810... Loss: 1.7147... Val Loss: 1.7615
Epoch: 5/8... Step: 18820... Loss: 1.7144... Val Loss: 1.7613
Epoch: 5/8... Step: 18830... Loss: 1.7745... Val Loss: 1.7622
Epoch: 5/8... Step: 18840... Loss: 1.9450... Val Loss: 1.7666
Epoch: 5/8... Step: 18850... Loss: 1.7512... Val Loss: 1.7682
Epoch: 5/8... Step: 18860... Loss: 1.8283... Val Loss: 1.7651
Epoch: 5/8... Step: 18870... Loss: 1.8738... Val Loss: 1.7619
Epoch: 5/8... Step: 18880... Loss: 1.9820... Val Loss: 1.7612
Epoch: 5/8... Step: 18890... Loss: 1.7061... Val Loss: 1.7639
Epoch: 5/8... Step: 18900... Loss: 1.8766... Val Loss: 1.7628
Epoch: 5/8... Step: 18910... Loss: 1.8630... Val Loss: 1.7595
Epoch: 5/8... Step: 18920... Loss: 1.7803... Val Loss: 1.7578
Epoch: 5/8... Step: 18930... Loss: 1.8283... Val Loss: 1.7599
Epoch: 5/8... Step: 18940... Loss: 1.9841... Val Loss: 1.7591
Epoch: 5

Epoch: 5/8... Step: 20120... Loss: 1.6891... Val Loss: 1.7554
Epoch: 5/8... Step: 20130... Loss: 1.8146... Val Loss: 1.7546
Epoch: 5/8... Step: 20140... Loss: 1.6746... Val Loss: 1.7529
Epoch: 5/8... Step: 20150... Loss: 1.7502... Val Loss: 1.7559
Epoch: 5/8... Step: 20160... Loss: 1.8057... Val Loss: 1.7576
Epoch: 5/8... Step: 20170... Loss: 1.9280... Val Loss: 1.7568
Epoch: 5/8... Step: 20180... Loss: 1.7939... Val Loss: 1.7538
Epoch: 5/8... Step: 20190... Loss: 1.7758... Val Loss: 1.7542
Epoch: 5/8... Step: 20200... Loss: 1.9243... Val Loss: 1.7532
Epoch: 5/8... Step: 20210... Loss: 1.8709... Val Loss: 1.7502
Epoch: 5/8... Step: 20220... Loss: 1.8270... Val Loss: 1.7489
Epoch: 5/8... Step: 20230... Loss: 1.9351... Val Loss: 1.7513
Epoch: 5/8... Step: 20240... Loss: 1.9457... Val Loss: 1.7575
Epoch: 5/8... Step: 20250... Loss: 1.8478... Val Loss: 1.7601
Epoch: 5/8... Step: 20260... Loss: 1.7951... Val Loss: 1.7572
Epoch: 5/8... Step: 20270... Loss: 1.8499... Val Loss: 1.7548
Epoch: 5

Epoch: 5/8... Step: 21450... Loss: 1.8401... Val Loss: 1.7556
Epoch: 5/8... Step: 21460... Loss: 1.8450... Val Loss: 1.7571
Epoch: 5/8... Step: 21470... Loss: 1.9133... Val Loss: 1.7580
Epoch: 5/8... Step: 21480... Loss: 1.9002... Val Loss: 1.7564
Epoch: 5/8... Step: 21490... Loss: 2.1487... Val Loss: 1.7541
Epoch: 5/8... Step: 21500... Loss: 1.7971... Val Loss: 1.7511
Epoch: 5/8... Step: 21510... Loss: 1.7650... Val Loss: 1.7496
Epoch: 5/8... Step: 21520... Loss: 1.8089... Val Loss: 1.7487
Epoch: 5/8... Step: 21530... Loss: 1.9391... Val Loss: 1.7523
Epoch: 5/8... Step: 21540... Loss: 1.8076... Val Loss: 1.7506
Epoch: 5/8... Step: 21550... Loss: 1.7579... Val Loss: 1.7499
Epoch: 5/8... Step: 21560... Loss: 1.8204... Val Loss: 1.7498
Epoch: 5/8... Step: 21570... Loss: 1.9403... Val Loss: 1.7538
Epoch: 5/8... Step: 21580... Loss: 1.8688... Val Loss: 1.7504
Epoch: 5/8... Step: 21590... Loss: 1.8243... Val Loss: 1.7491
Epoch: 5/8... Step: 21600... Loss: 1.8649... Val Loss: 1.7510
Epoch: 5

Epoch: 5/8... Step: 22780... Loss: 1.9506... Val Loss: 1.7548
Epoch: 5/8... Step: 22790... Loss: 1.7896... Val Loss: 1.7539
Epoch: 5/8... Step: 22800... Loss: 1.7915... Val Loss: 1.7524
Epoch: 5/8... Step: 22810... Loss: 2.0043... Val Loss: 1.7516
Epoch: 5/8... Step: 22820... Loss: 1.8999... Val Loss: 1.7509
Epoch: 5/8... Step: 22830... Loss: 1.9198... Val Loss: 1.7514
Epoch: 5/8... Step: 22840... Loss: 1.7786... Val Loss: 1.7576
Epoch: 5/8... Step: 22850... Loss: 1.8071... Val Loss: 1.7583
Epoch: 5/8... Step: 22860... Loss: 1.7735... Val Loss: 1.7556
Epoch: 5/8... Step: 22870... Loss: 1.8718... Val Loss: 1.7546
Epoch: 5/8... Step: 22880... Loss: 1.8730... Val Loss: 1.7515
Epoch: 5/8... Step: 22890... Loss: 1.8033... Val Loss: 1.7515
Epoch: 5/8... Step: 22900... Loss: 1.8995... Val Loss: 1.7525
Epoch: 5/8... Step: 22910... Loss: 1.7806... Val Loss: 1.7543
Epoch: 5/8... Step: 22920... Loss: 1.7822... Val Loss: 1.7524
Epoch: 5/8... Step: 22930... Loss: 1.9152... Val Loss: 1.7516
Epoch: 5

Epoch: 6/8... Step: 24110... Loss: 1.9191... Val Loss: 1.7462
Epoch: 6/8... Step: 24120... Loss: 1.8172... Val Loss: 1.7439
Epoch: 6/8... Step: 24130... Loss: 1.7844... Val Loss: 1.7452
Epoch: 6/8... Step: 24140... Loss: 1.8349... Val Loss: 1.7458
Epoch: 6/8... Step: 24150... Loss: 2.0377... Val Loss: 1.7465
Epoch: 6/8... Step: 24160... Loss: 1.8404... Val Loss: 1.7476
Epoch: 6/8... Step: 24170... Loss: 1.7461... Val Loss: 1.7450
Epoch: 6/8... Step: 24180... Loss: 1.7003... Val Loss: 1.7459
Epoch: 6/8... Step: 24190... Loss: 1.8575... Val Loss: 1.7471
Epoch: 6/8... Step: 24200... Loss: 1.9594... Val Loss: 1.7497
Epoch: 6/8... Step: 24210... Loss: 1.8000... Val Loss: 1.7488
Epoch: 6/8... Step: 24220... Loss: 1.8413... Val Loss: 1.7483
Epoch: 6/8... Step: 24230... Loss: 1.9845... Val Loss: 1.7483
Epoch: 6/8... Step: 24240... Loss: 1.6730... Val Loss: 1.7469
Epoch: 6/8... Step: 24250... Loss: 2.0480... Val Loss: 1.7485
Epoch: 6/8... Step: 24260... Loss: 1.8540... Val Loss: 1.7466
Epoch: 6

Epoch: 6/8... Step: 25440... Loss: 1.9330... Val Loss: 1.7462
Epoch: 6/8... Step: 25450... Loss: 1.8006... Val Loss: 1.7478
Epoch: 6/8... Step: 25460... Loss: 1.8241... Val Loss: 1.7472
Epoch: 6/8... Step: 25470... Loss: 1.8987... Val Loss: 1.7475
Epoch: 6/8... Step: 25480... Loss: 1.7071... Val Loss: 1.7519
Epoch: 6/8... Step: 25490... Loss: 1.7136... Val Loss: 1.7529
Epoch: 6/8... Step: 25500... Loss: 1.8059... Val Loss: 1.7495
Epoch: 6/8... Step: 25510... Loss: 1.7098... Val Loss: 1.7467
Epoch: 6/8... Step: 25520... Loss: 1.8989... Val Loss: 1.7473
Epoch: 6/8... Step: 25530... Loss: 1.8660... Val Loss: 1.7488
Epoch: 6/8... Step: 25540... Loss: 1.8534... Val Loss: 1.7491
Epoch: 6/8... Step: 25550... Loss: 1.7989... Val Loss: 1.7496
Epoch: 6/8... Step: 25560... Loss: 1.7649... Val Loss: 1.7486
Epoch: 6/8... Step: 25570... Loss: 1.7357... Val Loss: 1.7421
Epoch: 6/8... Step: 25580... Loss: 1.9274... Val Loss: 1.7415
Epoch: 6/8... Step: 25590... Loss: 1.8215... Val Loss: 1.7409
Epoch: 6

Epoch: 6/8... Step: 26770... Loss: 1.9371... Val Loss: 1.7445
Epoch: 6/8... Step: 26780... Loss: 1.7445... Val Loss: 1.7431
Epoch: 6/8... Step: 26790... Loss: 1.7492... Val Loss: 1.7425
Epoch: 6/8... Step: 26800... Loss: 1.7790... Val Loss: 1.7462
Epoch: 6/8... Step: 26810... Loss: 1.6470... Val Loss: 1.7469
Epoch: 6/8... Step: 26820... Loss: 1.8441... Val Loss: 1.7463
Epoch: 6/8... Step: 26830... Loss: 1.8860... Val Loss: 1.7460
Epoch: 6/8... Step: 26840... Loss: 1.6558... Val Loss: 1.7453
Epoch: 6/8... Step: 26850... Loss: 1.5879... Val Loss: 1.7457
Epoch: 6/8... Step: 26860... Loss: 1.7204... Val Loss: 1.7478
Epoch: 6/8... Step: 26870... Loss: 1.8227... Val Loss: 1.7471
Epoch: 6/8... Step: 26880... Loss: 1.7699... Val Loss: 1.7497
Epoch: 6/8... Step: 26890... Loss: 1.8138... Val Loss: 1.7491
Epoch: 6/8... Step: 26900... Loss: 1.6164... Val Loss: 1.7485
Epoch: 6/8... Step: 26910... Loss: 1.7652... Val Loss: 1.7468
Epoch: 6/8... Step: 26920... Loss: 1.7606... Val Loss: 1.7460
Epoch: 6

Epoch: 7/8... Step: 28100... Loss: 2.0198... Val Loss: 1.7414
Epoch: 7/8... Step: 28110... Loss: 1.6891... Val Loss: 1.7427
Epoch: 7/8... Step: 28120... Loss: 1.8279... Val Loss: 1.7431
Epoch: 7/8... Step: 28130... Loss: 1.7736... Val Loss: 1.7436
Epoch: 7/8... Step: 28140... Loss: 1.8987... Val Loss: 1.7451
Epoch: 7/8... Step: 28150... Loss: 1.7260... Val Loss: 1.7477
Epoch: 7/8... Step: 28160... Loss: 1.7979... Val Loss: 1.7454
Epoch: 7/8... Step: 28170... Loss: 1.7475... Val Loss: 1.7441
Epoch: 7/8... Step: 28180... Loss: 1.8875... Val Loss: 1.7466
Epoch: 7/8... Step: 28190... Loss: 1.7397... Val Loss: 1.7454
Epoch: 7/8... Step: 28200... Loss: 1.8891... Val Loss: 1.7436
Epoch: 7/8... Step: 28210... Loss: 1.8761... Val Loss: 1.7413
Epoch: 7/8... Step: 28220... Loss: 1.7370... Val Loss: 1.7404
Epoch: 7/8... Step: 28230... Loss: 1.8731... Val Loss: 1.7415
Epoch: 7/8... Step: 28240... Loss: 1.9399... Val Loss: 1.7458
Epoch: 7/8... Step: 28250... Loss: 1.7496... Val Loss: 1.7447
Epoch: 7

Epoch: 7/8... Step: 29430... Loss: 1.9776... Val Loss: 1.7423
Epoch: 7/8... Step: 29440... Loss: 1.7624... Val Loss: 1.7411
Epoch: 7/8... Step: 29450... Loss: 1.8399... Val Loss: 1.7379
Epoch: 7/8... Step: 29460... Loss: 1.7274... Val Loss: 1.7357
Epoch: 7/8... Step: 29470... Loss: 1.7624... Val Loss: 1.7349
Epoch: 7/8... Step: 29480... Loss: 1.8889... Val Loss: 1.7365
Epoch: 7/8... Step: 29490... Loss: 1.7545... Val Loss: 1.7378
Epoch: 7/8... Step: 29500... Loss: 1.8119... Val Loss: 1.7405
Epoch: 7/8... Step: 29510... Loss: 1.7519... Val Loss: 1.7377
Epoch: 7/8... Step: 29520... Loss: 1.6057... Val Loss: 1.7368
Epoch: 7/8... Step: 29530... Loss: 1.6922... Val Loss: 1.7368
Epoch: 7/8... Step: 29540... Loss: 1.7716... Val Loss: 1.7410
Epoch: 7/8... Step: 29550... Loss: 1.8155... Val Loss: 1.7462
Epoch: 7/8... Step: 29560... Loss: 1.8403... Val Loss: 1.7414
Epoch: 7/8... Step: 29570... Loss: 1.7407... Val Loss: 1.7414
Epoch: 7/8... Step: 29580... Loss: 1.7141... Val Loss: 1.7435
Epoch: 7

Epoch: 7/8... Step: 30760... Loss: 1.7689... Val Loss: 1.7412
Epoch: 7/8... Step: 30770... Loss: 1.8739... Val Loss: 1.7405
Epoch: 7/8... Step: 30780... Loss: 1.9728... Val Loss: 1.7392
Epoch: 7/8... Step: 30790... Loss: 1.7094... Val Loss: 1.7403
Epoch: 7/8... Step: 30800... Loss: 1.7599... Val Loss: 1.7393
Epoch: 7/8... Step: 30810... Loss: 1.7041... Val Loss: 1.7377
Epoch: 7/8... Step: 30820... Loss: 1.8331... Val Loss: 1.7355
Epoch: 7/8... Step: 30830... Loss: 1.8302... Val Loss: 1.7354
Epoch: 7/8... Step: 30840... Loss: 1.9864... Val Loss: 1.7376
Epoch: 7/8... Step: 30850... Loss: 1.8422... Val Loss: 1.7365
Epoch: 7/8... Step: 30860... Loss: 1.9152... Val Loss: 1.7370
Epoch: 7/8... Step: 30870... Loss: 1.8735... Val Loss: 1.7371
Epoch: 7/8... Step: 30880... Loss: 1.8622... Val Loss: 1.7393
Epoch: 7/8... Step: 30890... Loss: 1.8300... Val Loss: 1.7370
Epoch: 7/8... Step: 30900... Loss: 1.7753... Val Loss: 1.7327
Epoch: 7/8... Step: 30910... Loss: 1.8571... Val Loss: 1.7302
Epoch: 7

Epoch: 7/8... Step: 32090... Loss: 1.8485... Val Loss: 1.7410
Epoch: 7/8... Step: 32100... Loss: 1.8709... Val Loss: 1.7416
Epoch: 7/8... Step: 32110... Loss: 1.7403... Val Loss: 1.7374
Epoch: 7/8... Step: 32120... Loss: 1.9690... Val Loss: 1.7341
Epoch: 7/8... Step: 32130... Loss: 1.7347... Val Loss: 1.7319
Epoch: 7/8... Step: 32140... Loss: 1.6978... Val Loss: 1.7333
Epoch: 7/8... Step: 32150... Loss: 1.8407... Val Loss: 1.7320
Epoch: 7/8... Step: 32160... Loss: 1.8402... Val Loss: 1.7344
Epoch: 7/8... Step: 32170... Loss: 1.8243... Val Loss: 1.7321
Epoch: 7/8... Step: 32180... Loss: 1.9149... Val Loss: 1.7297
Epoch: 7/8... Step: 32190... Loss: 1.7715... Val Loss: 1.7284
Epoch: 7/8... Step: 32200... Loss: 1.8149... Val Loss: 1.7299
Epoch: 7/8... Step: 32210... Loss: 1.7896... Val Loss: 1.7309
Epoch: 7/8... Step: 32220... Loss: 1.6834... Val Loss: 1.7317
Epoch: 7/8... Step: 32230... Loss: 1.9711... Val Loss: 1.7296
Epoch: 7/8... Step: 32240... Loss: 1.7084... Val Loss: 1.7318
Epoch: 7

Epoch: 8/8... Step: 33420... Loss: 1.8729... Val Loss: 1.7339
Epoch: 8/8... Step: 33430... Loss: 1.6404... Val Loss: 1.7413
Epoch: 8/8... Step: 33440... Loss: 2.0103... Val Loss: 1.7371
Epoch: 8/8... Step: 33450... Loss: 1.8550... Val Loss: 1.7339
Epoch: 8/8... Step: 33460... Loss: 1.8851... Val Loss: 1.7335
Epoch: 8/8... Step: 33470... Loss: 1.7831... Val Loss: 1.7328
Epoch: 8/8... Step: 33480... Loss: 1.8241... Val Loss: 1.7297
Epoch: 8/8... Step: 33490... Loss: 1.7071... Val Loss: 1.7287
Epoch: 8/8... Step: 33500... Loss: 1.8849... Val Loss: 1.7334
Epoch: 8/8... Step: 33510... Loss: 1.6921... Val Loss: 1.7324
Epoch: 8/8... Step: 33520... Loss: 1.7594... Val Loss: 1.7316
Epoch: 8/8... Step: 33530... Loss: 1.8846... Val Loss: 1.7302
Epoch: 8/8... Step: 33540... Loss: 1.8204... Val Loss: 1.7317
Epoch: 8/8... Step: 33550... Loss: 1.8405... Val Loss: 1.7295
Epoch: 8/8... Step: 33560... Loss: 1.7584... Val Loss: 1.7304
Epoch: 8/8... Step: 33570... Loss: 1.5844... Val Loss: 1.7315
Epoch: 8

Epoch: 8/8... Step: 34750... Loss: 1.7707... Val Loss: 1.7352
Epoch: 8/8... Step: 34760... Loss: 1.8328... Val Loss: 1.7327
Epoch: 8/8... Step: 34770... Loss: 1.7687... Val Loss: 1.7296
Epoch: 8/8... Step: 34780... Loss: 1.8157... Val Loss: 1.7308
Epoch: 8/8... Step: 34790... Loss: 1.7688... Val Loss: 1.7282
Epoch: 8/8... Step: 34800... Loss: 1.9967... Val Loss: 1.7276
Epoch: 8/8... Step: 34810... Loss: 1.7060... Val Loss: 1.7285
Epoch: 8/8... Step: 34820... Loss: 1.7988... Val Loss: 1.7301
Epoch: 8/8... Step: 34830... Loss: 1.8087... Val Loss: 1.7336
Epoch: 8/8... Step: 34840... Loss: 1.8982... Val Loss: 1.7349
Epoch: 8/8... Step: 34850... Loss: 1.8451... Val Loss: 1.7324
Epoch: 8/8... Step: 34860... Loss: 1.7377... Val Loss: 1.7349
Epoch: 8/8... Step: 34870... Loss: 1.8969... Val Loss: 1.7329
Epoch: 8/8... Step: 34880... Loss: 1.9346... Val Loss: 1.7327
Epoch: 8/8... Step: 34890... Loss: 1.7558... Val Loss: 1.7321
Epoch: 8/8... Step: 34900... Loss: 1.7698... Val Loss: 1.7309
Epoch: 8

Epoch: 8/8... Step: 36080... Loss: 1.8944... Val Loss: 1.7328
Epoch: 8/8... Step: 36090... Loss: 1.8578... Val Loss: 1.7297
Epoch: 8/8... Step: 36100... Loss: 1.7548... Val Loss: 1.7321
Epoch: 8/8... Step: 36110... Loss: 1.8158... Val Loss: 1.7355
Epoch: 8/8... Step: 36120... Loss: 1.7536... Val Loss: 1.7349
Epoch: 8/8... Step: 36130... Loss: 1.6872... Val Loss: 1.7319
Epoch: 8/8... Step: 36140... Loss: 1.7818... Val Loss: 1.7320
Epoch: 8/8... Step: 36150... Loss: 1.7702... Val Loss: 1.7322
Epoch: 8/8... Step: 36160... Loss: 1.7634... Val Loss: 1.7348
Epoch: 8/8... Step: 36170... Loss: 1.7710... Val Loss: 1.7352
Epoch: 8/8... Step: 36180... Loss: 1.8890... Val Loss: 1.7323
Epoch: 8/8... Step: 36190... Loss: 1.6004... Val Loss: 1.7323
Epoch: 8/8... Step: 36200... Loss: 1.6733... Val Loss: 1.7292
Epoch: 8/8... Step: 36210... Loss: 1.8121... Val Loss: 1.7283
Epoch: 8/8... Step: 36220... Loss: 1.8011... Val Loss: 1.7286
Epoch: 8/8... Step: 36230... Loss: 1.7466... Val Loss: 1.7313
Epoch: 8

## Getting the best model

To set your hyperparameters to get the best performance, you'll want to watch the training and validation losses. If your training loss is much lower than the validation loss, you're overfitting. Increase regularization (more dropout) or use a smaller network. If the training and validation losses are close, you're underfitting so you can increase the size of the network.

## Hyperparameters

Here are the hyperparameters for the network.

In defining the model:
* `n_hidden` - The number of units in the hidden layers.
* `n_layers` - Number of hidden LSTM layers to use.

We assume that dropout probability and learning rate will be kept at the default, in this example.

And in training:
* `batch_size` - Number of sequences running through the network in one pass.
* `seq_length` - Number of characters in the sequence the network is trained on. Larger is better typically, the network will learn more long range dependencies. But it takes longer to train. 100 is typically a good number here.
* `lr` - Learning rate for training

Here's some good advice from Andrej Karpathy on training the network. I'm going to copy it in here for your benefit, but also link to [where it originally came from](https://github.com/karpathy/char-rnn#tips-and-tricks).

> ## Tips and Tricks

>### Monitoring Validation Loss vs. Training Loss
>If you're somewhat new to Machine Learning or Neural Networks it can take a bit of expertise to get good models. The most important quantity to keep track of is the difference between your training loss (printed during training) and the validation loss (printed once in a while when the RNN is run on the validation data (by default every 1000 iterations)). In particular:

> - If your training loss is much lower than validation loss then this means the network might be **overfitting**. Solutions to this are to decrease your network size, or to increase dropout. For example you could try dropout of 0.5 and so on.
> - If your training/validation loss are about equal then your model is **underfitting**. Increase the size of your model (either number of layers or the raw number of neurons per layer)

> ### Approximate number of parameters

> The two most important parameters that control the model are `n_hidden` and `n_layers`. I would advise that you always use `n_layers` of either 2/3. The `n_hidden` can be adjusted based on how much data you have. The two important quantities to keep track of here are:

> - The number of parameters in your model. This is printed when you start training.
> - The size of your dataset. 1MB file is approximately 1 million characters.

>These two should be about the same order of magnitude. It's a little tricky to tell. Here are some examples:

> - I have a 100MB dataset and I'm using the default parameter settings (which currently print 150K parameters). My data size is significantly larger (100 mil >> 0.15 mil), so I expect to heavily underfit. I am thinking I can comfortably afford to make `n_hidden` larger.
> - I have a 10MB dataset and running a 10 million parameter model. I'm slightly nervous and I'm carefully monitoring my validation loss. If it's larger than my training loss then I may want to try to increase dropout a bit and see if that helps the validation loss.

> ### Best models strategy

>The winning strategy to obtaining very good models (if you have the compute time) is to always err on making the network larger (as large as you're willing to wait for it to compute) and then try different dropout values (between 0,1). Whatever model has the best validation performance (the loss, written in the checkpoint filename, low is good) is the one you should use in the end.

>It is very common in deep learning to run many different models with many different hyperparameter settings, and in the end take whatever checkpoint gave the best validation performance.

>By the way, the size of your training and validation splits are also parameters. Make sure you have a decent amount of data in your validation set or otherwise the validation performance will be noisy and not very informative.

## Checkpoint

After training, we'll save the model so we can load it again later if we need too. Here I'm saving the parameters needed to create the same architecture, the hidden layer hyperparameters and the text characters.

In [35]:
# change the name, for saving multiple files
model_name = 'rnn_x_epoch.net'

checkpoint = {'n_hidden': net.n_hidden,
              'n_layers': net.n_layers,
              'state_dict': net.state_dict(),
              'tokens': net.chars}

with open(model_name, 'wb') as f:
    torch.save(checkpoint, f)

---
## Making Predictions

Now that the model is trained, we'll want to sample from it and make predictions about next characters! To sample, we pass in a character and have the network predict the next character. Then we take that character, pass it back in, and get another predicted character. Just keep doing this and you'll generate a bunch of text!

### A note on the `predict`  function

The output of our RNN is from a fully-connected layer and it outputs a **distribution of next-character scores**.

> To actually get the next character, we apply a softmax function, which gives us a *probability* distribution that we can then sample to predict the next character.

### Top K sampling

Our predictions come from a categorical probability distribution over all the possible characters. We can make the sample text and make it more reasonable to handle (with less variables) by only considering some $K$ most probable characters. This will prevent the network from giving us completely absurd characters while allowing it to introduce some noise and randomness into the sampled text. Read more about [topk, here](https://pytorch.org/docs/stable/torch.html#torch.topk).


In [36]:
def predict(net, char, h=None, top_k=None):
        ''' Given a character, predict the next character.
            Returns the predicted character and the hidden state.
        '''
        
        # tensor inputs
        x = np.array([[net.char2int[char]]])
        x = one_hot_encode(x, len(net.chars))
        inputs = torch.from_numpy(x)
        
        if(train_on_gpu):
            inputs = inputs.cuda()
        
        # detach hidden state from history
        h = tuple([each.data for each in h])
        # get the output of the model
        out, h = net(inputs, h)

        # get the character probabilities
        p = F.softmax(out, dim=1).data
        if(train_on_gpu):
            p = p.cpu() # move to cpu
        
        # get top characters
        if top_k is None:
            top_ch = np.arange(len(net.chars))
        else:
            p, top_ch = p.topk(top_k)
            top_ch = top_ch.numpy().squeeze()
        
        # select the likely next character with some element of randomness
        p = p.numpy().squeeze()
        char = np.random.choice(top_ch, p=p/p.sum())
        
        # return the encoded value of the predicted char and the hidden state
        return net.int2char[char], h

### Priming and generating text 

Typically you'll want to prime the network so you can build up a hidden state. Otherwise the network will start out generating characters at random. In general the first bunch of characters will be a little rough since it hasn't built up a long history of characters to predict from.

In [37]:
def sample(net, size, prime='The', top_k=None):
        
    if(train_on_gpu):
        net.cuda()
    else:
        net.cpu()
    
    net.eval() # eval mode
    
    # First off, run through the prime characters
    chars = [ch for ch in prime]
    h = net.init_hidden(1)
    for ch in prime:
        char, h = predict(net, ch, h, top_k=top_k)

    chars.append(char)
    
    # Now pass in the previous character and get a new one
    for ii in range(size):
        char, h = predict(net, chars[-1], h, top_k=top_k)
        chars.append(char)

    return ''.join(chars)

In [41]:
print(sample(net, 1000, prime='Anna', top_k=5))

Anna had said and was him. "I thought and a some, and that the tortion was his ceesing the
town to some woman to teel the the seriad that to an tardited, and
stept to the thought this tinds of him,
and stind him her tere, and a tortents and, at the tate and harn him had that themsires as himself tryan of sately ser that this, and she was a to any one as that was him, was she had now
steldis the
towards went
of a storled worsttatadions, sone all he said, and they would strange as the course, all still, as had an tine, son, as they seet and his tind to her as a land he
sat
a late, and he were cittity, that," the talking as, and to what already were that the tand to his tine. And a selred a lud and somo all if the titoned
at his samided the doods.
""That she was all
hand. "I was at the compound, they was think thread that terry of they's a this at the decuncilished at it a thind of that any and starttant that's a chonged to say of the chicded wife was
his any him, to
him when a tas he wou

## Loading a checkpoint

In [47]:
# Here we have loaded in a model that trained over 20 epochs `rnn_20_epoch.net`
with open('rnn_x_epoch.net', 'rb') as f:
    checkpoint = torch.load(f)
    
loaded = CharRNN(checkpoint['tokens'], n_hidden=checkpoint['n_hidden'], n_layers=checkpoint['n_layers'])
loaded.load_state_dict(checkpoint['state_dict'])

<All keys matched successfully>

In [49]:
# Sample using a loaded model
print(sample(loaded, 2000, top_k=5, prime="And Levin said"))

And Levin said
though he could some when he said
all the mothing her, had not to him. He cut his tan tertens a man the selred
along husnatitade of her tinger,
there could that the sended, at the wimped was the whole the detalred the told to
her ask of a tind of a cossensicanding aste with the compees teer and was, had nothing of his tast too have said he cins had an treast at her
and her stice.

"I won to was a chicd of the shome to her a latal he took
that's they said was, stenlited. If a tine her hers and so there tone shere and tall he she to somether with stoming, her sher of a still a sheed of
his watenens when shy has
see him, as the thitting of his arrov to the droves as the dalking a sat, with the did not how well, his sterring it all that that
he was a little wife his the toon himself, and with the same alont
well, to all into the ceeses. "I had that he had she ciltlile ast to and had the warl, with the tasted on the condeed, was
and a shanding and that a tarled with
the stirc

In [50]:
# I added the below about state_dict for models and optimizers b/c it seemed useful & relevant (b/c we saved model's state dict
# in checkpoint and used it to load & recreate saved model).
# The code snippets below came from https://pytorch.org/tutorials/recipes/recipes/what_is_state_dict.html

# Print model's state_dict
print("Model's state_dict:")
for param_tensor in net.state_dict():
    print(param_tensor, "\t", net.state_dict()[param_tensor].size())

print()

# Print optimizer's state_dict
opt = torch.optim.Adam(net.parameters(), lr=learning_rate)
print("Optimizer's state_dict:")
for var_name in opt.state_dict():
    print(var_name, "\t", opt.state_dict()[var_name])

Model's state_dict:
lstm.weight_ih_l0 	 torch.Size([256, 83])
lstm.weight_hh_l0 	 torch.Size([256, 64])
lstm.bias_ih_l0 	 torch.Size([256])
lstm.bias_hh_l0 	 torch.Size([256])
lstm.weight_ih_l1 	 torch.Size([256, 64])
lstm.weight_hh_l1 	 torch.Size([256, 64])
lstm.bias_ih_l1 	 torch.Size([256])
lstm.bias_hh_l1 	 torch.Size([256])
fc1.weight 	 torch.Size([83, 64])
fc1.bias 	 torch.Size([83])

Optimizer's state_dict:
state 	 {}
param_groups 	 [{'lr': 0.01, 'betas': (0.9, 0.999), 'eps': 1e-08, 'weight_decay': 0, 'amsgrad': False, 'params': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]}]
