# Modeling Sequential Data Using Recurrent Neural Networks

In this chapter, we will explore **RNNs** and see their application in modeling sequential data and a specific subset of sequential data---time-series data.  In this chapter, we will cover the following topcis:
- Introducing Sequential Data
- RNNs for modeling sequences
- Long Short-Term Memory (LSTM)
- Truncated Backpropagation through time (T-BPTT)
- Implementing a multilayer RNN for sequence modeling in TensorFlow
- Project one: RNN sentiment analysis of the IMDb movie review dataset
- Project two: RNN character-level language modeling with LSTM cells, using text data from Shakespeare's Hamlet
- Using gradient clipping to avoid exploding gradients

## Modeling sequential data---Order Matters

What makes sequences unique from other data types is that elements in a sequence appear in a certain order, and are not independent of each other.  Order matters.

## Representing Sequences

Throughout this chapter, we will represent sequences as $\left( \mathbf{x}^{(1)}, \mathbf{x}^{(2)}, \ldots, \mathbf{x}^{(T)} \right)$.  The superscript indices indicate the order of the instances, and the length of the sequence is $T$.

The following figure shows and example of time-series data where both $x$'s and $y$'s naturally follow the order according to their time axis; therefore, both $x$'s and $y$'s are sequences:

<img src="images/16_01.png" style="width:500px">

RNNs are designed for modeling sequences and are capable of remembering past information and processing new events accordingly.

## The different categories of sequence modeling

Sequence modeling has many fascinating applications, such as language translation, image captioning, and text generation.

We need to uderstand the different types of sequence modeling tasks to develop an appropriate model.  The following figure shows several different relationship categories of input and output data:
<img src="images/16_02.png" style="width:500px">

If either the input or output data is a sequence, the data will form one of the following three different categories:
- **Many-to-One**: The input data is a sequence, but the output is a fixed-size vector, not a sequence.  For example, in sentiment analysis, the input is text-bases and the output is a class label.
- **One-to-Many**: The input data is in standard format, not a sequence, but the output is a sequence.  An example of this category is image captioning---the input is an image; the output is an English phrase.
- **Many-to-Many**: Both the input and the output arrays are sequences.  This category can be further divided based on whether the input and output are synchronized or not.  An example of a **synchronized** many-to-many modeling task is video classification, where each frame in a vdeo is labeled.  An example of a **delayed** many-to-many modeling taks is translating a language into another language.  For instance, an entire English sentence must be read and processed by a machine before producing its translation into German.

## RNNs for modeling sequences

In this section, now that we understand sequences, we can look at the foundations of RNNs.  We'll start by introducing the typical structure of an RNN, and we'll see how the data flows through it with one or more hidden layers.  We'll then examine how the neuron activations are computed in a typical RNN.  This will create a context for us to discuss the common challenges in training RNNs, and explore the modern solution to these challenges---LSTM.

## Understanding the structure and flow of an RNN

<img src="images/16_03.png" style="width:500px">
In a standard feedforward network, information flows from the input to the hidden layer, and then from the hidden layer to the output layer.  On the other hand, in a recurrent network, the hidden layer gets its input from both the input layer and the hidden layer from the previous time step.

The folow of information in adjacent time steps in the hidden layer allows the network to have a memory of past events.  This flow of information is usually displayed as a loop, also known as a recurrent edge in graph notation, which is how this general architecture got its name.

In the following figure, the single hedden layer network and the multilayer network illustrate two contrasting architectures:

<img src="images/16_04.png" style="width:500px">
In order to examine the architecture of RNNs and the flow of information, a compact representation with a recurrent edge can be unfolded, which you can see in the preceding figure.

As we know, each hidden unit in a standard neural network receives only one input---the net preactivation associated with the input layer.  Now, in contrast, each hidden unit in an RNN receives two distinct sets of input---the preactivation from the input layer and the activation of the same hidden layer from the previous time step $(t-1)$.

## Computing activations in an RNN

Now that we understand the structure and general flow of information in an RNN, let's get more specific and compute the actual activations of the hidden layers as well as the output layer.  For simplicity, we'll consider just a single hidden layer; however, the same concept applies to multilayer RNNs.

Each directed edge (the connections between boxes) in the representation of an RNN that we just looked at is associated with a weight matrix.  Those weights do not depend on time $t$; therefore, they are shared across the time axis.  The different weight matrices in a single layer RNN are as follows:
- $W_{xh}$: The weight matrix between the input $\mathbf{x}^{t}$ and the hidden layer $\mathbf{h}$.
- $W_{hh}$: The weight matrix associated with the recurrent edge.
- $W_{hy}$: The weight matrix between the hidden layer and the output layer.

You can see these weight matrices in the following figure:

<img src="images/16_05.png" style="width:500px">

In certain implementations, you may observe that the weight matrices $\mathbf{W}_{xh}$ and $\mathbf{W}_{hh}$ are concatenated to a combined matrix $\mathbf{W}_h = \left[ \mathbf{W}_{xh}; \mathbf{W}_{hh} \right]$.  Later on, we will make use of this notation as well.

Computing the activations is very similar to standard multilayer perceptrons, and other types of feedforward nerual networks.  For the hidden layer, the net input $\mathbf{z}_{h}$ (preactivation) is:
$$
\mathbf{z}_h^{(t)} = \mathbf{W}_{xh}\mathbf{x}^{(t)} + \mathbf{W}_{hh}\mathbf{h}^{(t-1)} + \mathbf{b}_h
$$

Then the activation of the hidden units at the time step $t$ are calculated as follows:
$$
\mathbf{h}^{(t)} = \phi_h\left( \mathbf{W}_{xh}\mathbf{x}^{(t)} + \mathbf{W}_{hh}\mathbf{h}^{(t-1)} + \mathbf{b}_h \right)
$$
where $\mathbf{b}_h$ is the bias vector for the hidden uniits, and $\phi_h(\cdot)$ is the activation function of the hidden layer.

The activations of the output units will be computed as follows:
$$
\mathbf{y}^{(t)} = \phi_y\left( \mathbf{W}_{hy}\mathbf{h}^{(t)} + \mathbf{b}_y \right)
$$

To help clarify this further, the following figure shows the process of computing these activtions with both formulations:

<img src="images/16_06.png" style="width:500px">

## The challenges of learning long-range interactions

Backpropagation through time, or BPTT, which we briefly mentioned in the previous information box, introduces some new challenges.

In computing the gradients of a loss function, the so-called **vanishing** or **exploding** gradient problem arises.  This problem is explained through the examples in the following figure, which shows an RNN with only one hidden unit for simplicity:
<img src="images/16_07.png" style="width:500px">

In practice, there are two solutions to this problem:
- Truncated Backpropagation through time (TBPTT)
- Long short-term memory (LSTM)

TBPTT clips the gradients above a given threshold.  While TBPTT can solve the exploding gradient problem, the truncation limits the number of steps that the gradient can effectively flow back and properly update the weights.

On the other hand, LSTM has been more succussful in modeling long-range sequences by overcoming the vanishing gradient problem.  Let us discuss the LSTM in more detail:

## LSTM units

The building block of an LSTM is a memory cell which essentially represents the hidden layer.

In each memory cell, there is a recurrent edge that has the desirable weight $w=1$.  The values associated with this recurrent edge is called the **cell state**.  The unfolded structure of a modern LSTM is shown in the following figure:

<img src="images/16_18.png" style="width:500px">

Notice that the cell state from the previous time step, $\mathbf{C}^{(t-1)}$, is modified to get the cell state at the current time step, $\mathbf{C}^{(t)}$, without being multiplied directly with any weight factor.

Notice that $\odot$ refers to the element-wise product, while $\oplus$ means element-wise summation.

In the figure, four boxes are indicated with an activation function, either the sigmoid function ($\sigma$), or the hyperbolic tangent (tanh), and a set of weights.  These units of computation with sigmoid activation functions, whose output are passed through $\odot$ are called **gates**.

In an LSTM cell, there are 3 different types of gates, known as the forget gate, the input gate, and the output gate:
- The **forget gate** ($\mathbf{f}_t$), allows the memory cell to reset the cell sate, without growing indefintely.  In fact, the forget gate decides which information is allowed to go through and which information to suppress.
$$
\mathbf{f}_t = \sigma\left( \mathbf{W}_{xf}\mathbf{x}^{(t)} + \mathbf{W}_{hf}\mathbf{h}^{(t-1)} + \mathbf{b}_f \right)
$$
- The input gate ($\mathbf{i}_t$) and the input node ($\mathbf{g}_t$) are responsible for updating the cell state.  The are computed as follows:
$$
\mathbf{i}_t = \sigma\left( \mathbf{W}_{xi}\mathbf{x}^{(t)} + \mathbf{W}_{hi}\mathbf{h}^{(t-1)} + \mathbf{b}_i \right)
$$
$$
\mathbf{g}_t = \tanh\left( \mathbf{W}_{xg}\mathbf{x}^{(t)} + \mathbf{W}_{hg}\mathbf{h}^{(t-1)} + \mathbf{b}_g \right)
$$

The cell state at time $t$ is then computed as follows:
$$
\mathbf{C}^{(t)} = \left( \mathbf{C}^{(t-1)}\odot\mathbf{f}_t \right)\oplus\left( \mathbf{i}_t\odot\mathbf{g}_t \right)
$$
- The output gate $\left( \mathbf{o}_t \right)$ decides how to update the values of the hidden units:
$$
\mathbf{o}_t = \tanh\left( \mathbf{W}_{xo}\mathbf{x}^{(t)} + \mathbf{W}_{ho}\mathbf{h}^{(t-1)} + \mathbf{b}_o \right)
$$
Given this, the hidden units at the current time step are computed as follows:
$$
\mathbf{h}^{(t)} = \mathbf{o}_{t}\odot\tanh\left( \mathbf{C}^{(t)} \right)
$$

## Implmenting a multilayer RNN for sequence modeling in TensorFlow

Now that we introduced the underlying theory behind RNNs, we are ready to move on to the more practical part to implement RNNs in TensorFlow.  During this rest of this chapter, we will apply RNNs to 2 common problem tasks:
1. Sentiment analysis
2. Language modeling

## Project one---performing sentiment analysis of IMDb movie reviews using multilayer RNNs

In this section and the following subsections, we will implement a multilayer RNN for sentiment analysis using a many-to-one architecture.

In the next section, we will implement a many-to-many RNN for an application of language modeling.

## Preparing the data

In the preprocessing steps in _Chapter 8_, we created a clean dataset named `movie_data.csv`, which we'll use again now.  So let us import the necessary modules and read the data into a pandas `DataFrame`, as follows:

In [1]:
import pyprind
import pandas as pd
from string import punctuation
import re
import numpy as np

In [2]:
df = pd.read_csv("movie_data.csv", encoding="utf-8")

Recall that this `df` data frame has 2 columns, name `'review'` and `'sentiment'`, where `'review'` contains the text of movie reviews and `'sentiment'` contains the `0` or `1` labels.  The text component of these movie reviews are sequences of words; therefore, we want to build an RNN model to process the words in each sequence, and at the end, classify the entire sequence to `0` or `1` classes.

To prepare the data for input to a neural network, we need to encode it into numeric values.  To do this, we first find the unique words in the dataset, which can be done using sets in Python.  However, using sets for such a large dataset is _not_ efficient.  A more efficient way is to use `Counter` from the collections package.  The documentation for counter can be found at https://docs.python.org/3/library/collections.html#collections.Counter

In the following code, we will define `counts` object from the `Counter` class that collects the counts of occurrence of each unique word in the text.  Note that in this particular application (and in contrast to the bag-of-words model), we are only interested in the set of unique words, and won't require the word counts, which are created as a side product.

Then, we create a mapping in the form of a dictionary that maps each unique word, in our dataset, to a unique integer number.  We call this dictionary `word_to_int`, which can be used to convert the entire text of a review into a list of numbers.  The unique words are sorted based on their counts, but any any arbitary order can be used without affecting the final results.  This process of converting a text into a list of numbers is performed using the following code:

In [3]:
## Preprocessing the data: Separate the words and count each word's occurrence

from collections import Counter

counts = Counter()
pbar = pyprind.ProgBar(len(df["review"]), title="Counting words occurrences")
for i, review in enumerate(df["review"]):
    text = "".join([c if c not in punctuation else " "+c+" " for c in review]).lower()
    df.loc[i, "review"] = text
    pbar.update()
    counts.update(text.split())

Counting words occurrences
0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:05:37


In [4]:
print(type(counts))
print(counts.get("the"))

<class 'collections.Counter'>
667950


In [9]:
## Create a mapping:  Map each unique word to an integer
word_counts = sorted(counts, key=counts.get, reverse=True)  # descending order
print(word_counts[:10])
word_to_int = {word: j for j, word in enumerate(word_counts, 1)}
print(word_to_int)

['the', '.', ',', 'and', 'a', 'of', 'to', "'", '/', 'is']


In [6]:
a = [word for word in df.loc[1, "review"].split()]
# print(a)
# print(df.loc[1, "review"])

In [7]:
mapped_reviews = []
pbar = pyprind.ProgBar(len(df['review']), title="Map reviews to ints")
for review in df["review"]:
    mapped_reviews.append([word_to_int[word] for word in review.split()])
    pbar.update()

Map reviews to ints
0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:02


In [8]:
print(mapped_reviews[:2])

[[15, 5646, 3, 1, 2160, 3977, 26959, 30, 4824, 1575, 29, 1133, 7, 1, 316, 19, 720, 1612, 6, 6653, 753, 3, 14781, 3, 8680, 2, 33, 1, 14782, 328, 3, 3612, 6, 2105, 3, 67, 22, 1947, 15, 1, 8755, 6, 54, 339, 4, 54, 602, 4712, 15343, 2, 1732, 19, 121, 170, 323, 3, 1, 565, 953, 25748, 30, 1464, 19622, 29, 3, 47, 10, 5, 1121, 1099, 1362, 18, 58, 3040, 15, 6002, 25, 44343, 15, 716, 2, 1544, 2, 5290, 3192, 4, 1670, 7, 21518, 3, 1103, 7, 3921, 1, 434, 26, 37, 1956, 1801, 2333, 30, 3679, 3294, 29, 26, 1, 1285, 6, 507, 5, 286, 2, 1, 5463, 10558, 4, 92, 34, 2568, 106, 3, 27, 26, 1, 1457, 6, 1, 5056, 1362, 1320, 8833, 30, 623, 12704, 29, 18, 22, 15, 2762, 6, 1, 3488, 15, 1, 1358, 8, 21, 3, 44, 1873, 1, 1795, 4, 5, 5000, 6, 661, 4, 308, 7, 999, 1, 602, 2, 12, 13, 9, 11, 12, 13, 9, 11, 20, 602, 15, 14781, 20, 10, 5, 62, 255, 24, 3, 26, 1, 307, 76, 6, 5, 602, 6, 5, 3521, 170, 174, 260, 18, 22, 2373, 45, 5, 3041, 2160, 654, 412, 22, 5, 3129, 2, 1, 976, 4, 994, 244, 347, 79, 2413, 7, 999, 1, 602, 25, 65,

There is however one issue that we still need to solve---the sequences currently have different lengths.  In order to generate input data that is compatible with our RNN architecture, we will need to make sure that all the sequences have the same length.

For this purpose, we define a parameter called `sequence_length` that we set to 200.  Sequences that have than 200 words will be left-padded with zeros.  Vice versa, sequences that are longer than 200 words are cut such that only the last 200 corresponding words will be used.  We can implement this preprocessing step in two steps:
1. Create a matrix of zeros, where each row corresponds to a sequence of size 200.
2. Fill the index of words in each sequence from the right-hand side of the matrix.  Thus, if a sequence has a length of 150, the first 50 elements of the corresponding row will stay zero.

These 2 steps are shown in the following figure, for a small example with 8 sequences of sizes 4, 12, 8, 11, 7, 3, 10, 13.

<img src="images/16_09.png" style="width:500px">

Note that `sequence_length` is, in fact, a hyperparameter, and can be tuned for optimal performance.  We encourage you to fine-tune this parameter with different values, such as 50, 100, 200, 250, and 300.

Check out the following code for the implementation of these steps to create sequences of the same length:

In [10]:
# Define same-length sequences.
# If the sequence length < 200:  left-pad with zeros.
# If the sequence length > 200:  use the LAST 200 elements.

sequence_length = 200  # known as T in our RNN formulas.
sequences = np.zeros((len(mapped_reviews), sequence_length), dtype=int)

for i, row in enumerate(mapped_reviews):
    review_arr = np.array(row)
    sequences[i, -1*len(row):] = review_arr[-1*sequence_length:]

After we preprocess the dataset, we can proceed with splitting the data into separate training and test sets.  Since the dataset was already shuffled, we can simply take the first half of the dataset for training, and the second half for testing, as follows:

In [11]:
X_train = sequences[:25000, :]
y_train = df.loc[:25000, "sentiment"].values
X_test = sequences[25000:, :]
y_test = df.loc[25000:, "sentiment"].values

Now, if we want to separate the dataset for cross-validation, we can further split the second half of the data further to generate a smaller test set and a validation set for hyperparameter optimization.

Finally, we define a helper function that breaks a given dataset (which could be a training set or test set) into chunks, and returns a generator to iterate through these chunks (also known as mini-batches):

In [31]:
np.random.seed(123)  # for reproducibility

# Define a function to generate mini-batches:
def create_batch_generator(x, y=None, batch_size=64):
    n_batches = len(x)//batch_size
    x = x[:n_batches*batch_size]
    if y is not None:
        y = y[:n_batches*batch_size]
    for ii in range(0, len(x), batch_size):
        if y is not None:
            yield x[ii:ii+batch_size], y[ii:ii+batch_size]
        else:
            yield x[ii:ii+batch_size]

Using generators, as we've done in this code, is a very useful technique for handling memory limitations.  This is the recommended approach for splitting the dataset into mini-batches for training a neural network, rather than creating all the data splits upfront and keeping them in memory during training.

## Embedding

During the data preparation in the previous step, we generated sequences of the same length.  The elements of these sequences were integer numbers that corresponded to the _indices_ of unique words.

These words indices can be converted into input features in several different ways.  One naïve way is to apply one-hot encoding to convert indices into vectors of zeros and ones.  Then each word will be mapped to a vector whose size is the number of unique words in the entire dataset.  But a model trained on such features may suffer from the **curse of dimensionality**.  Furthermore, these features are very sparse, since all are zero except one.

A more elegant way is to map each word to a vector of fixed size with real-valued elements (not necessarily integers).  In contrast to the one-hot encoded vectors, we can use finite-sized vectors to represent an infinite amount of real numbers (in theory, we can extract infinite real numbers from a given interval, for example $[-1,1]$).

This is the idea behind embedding, which is a feature-learning technique to automatically learn the salient features to represent the words in our dataset.  Given the number of unique words _unique\_words_, we can choose the size of the embedding vectors to be much smaller than the number of unique words to represent the entire vocabulary.

The advantages of embedding over one-hot encoding are as follows:
- A reduction in the dimensionality of the feature space to decrease the effect of the curse of dimensionality.
- The extraction of salient features since the embedding layer in a neural network is trainable.

The following schematic representation shows how embedding works by mapping vocabulary indices to a trainable embedding matrix:

<img src="images/16_10.png" style="width:500px">

TensorFlow implements an efficient function, `tf.nn.embedding_lookup` that maps each integer that corresponds to a unique word, to a row of this trainable matrix.  For example, integer 1 is mapped to the first row, integer 2 is mapped to the second row, and so on.  Then, given a sequence of integers, such as $[0, 5, 3, 4, 19, 2, \ldots]$, we need to look up the corresponding rows for each element of this sequence.

Now, let us see how we can create an embedding layer in practice.  If we have `tf_x` as the input layer where the corresponding vocabulary indices are fed with type `tf.int32`, then creating an embedding layer can be done in two ste3ps, as follows:
1. We start by creating a matrix of size $[\mathrm{n\_words}\times\mathrm{embedding\_size}]$ as a tensor variable, which we call embedding, and we initialise its elements randomly with floats between $[-1, 1]$:
```python
embedding = tf.Variable(tf.random_uniform(shape=(n_words, embedding_size), minval=-1, maxval=1))
```
2. Then, we use the `tf.nn.embedding_lookup` function to look up the row in the embedding matrix associated with each element of `tf_x`:
```python
embed_x = tf.nn.embedding_lookup(embedding, tf_x)
```

Notice that `tf.nn.embedding_lookup` requires 2 mandatory arguments: the embedding tensor, and the lookup IDs.

## Building an RNN model

Now we're ready to build an RNN model.  We'll implement a `SentimentRNN` class that has the following methods:
- A constructor to set all the model parameters, and then create a computation graph, and call the `self.build` method to build the multilayer RNN model.
- A `build` method that declares three placeholders for input data, input labels, and the keep probability for the dropout configuration of the hidden layer.  After declaring these, it creates an embedding layer, and builds the multilayer RNN using the embedded representation as the input.
- A `train` method that creates a TensorFlow session for launching the computation graph, iterates through the mini-batches of data, and runs for a fixed number of epochs, to minimise the cost function defined in the graph.  This method also saves the model after 10 epochs for checkpointing.
- A `predict` method that creates a new session, restores the last checkpoint saved during the training process, and carries out the predictions for the test data.

In the following code, we'll see the implementation of this class and its methods broken into separate code sections.

## The SentimentRNN class Constructor

Let us start with the constructor of our `SentimentRNN` class, which we'll code as follows:

In [32]:
import tensorflow as tf

class SentimentRNN(object):
    def __init__(self, n_words, seq_len=200, lstm_size=256, num_layers=1,
                batch_size=64, learning_rate=0.0001, embed_size=200):
        self.n_words = n_words
        self.seq_len = seq_len
        self.lstm_size = lstm_size  # the number of hidden units
        self.num_layers = num_layers
        self.batch_size = batch_size
        self.learning_rate = learning_rate
        self.embed_size = embed_size
        
        self.g = tf.Graph()
        with self.g.as_default():
            tf.set_random_seed(123)
            self.build()
            self.saver = tf.train.Saver()
            self.init_op = tf.global_variables_initializer()
    
    """Please refer to the Markdown cell below: 'The build method'"""
    def build(self):
        # Define the placeholders
        tf_x = tf.placeholder(tf.int32, shape=(self.batch_size, self.seq_len), name="tf_x")
        tf_y = tf.placeholder(tf.float32, shape=(self.batch_size), name="tf_y")
        tf_keepprob = tf.placeholder(tf.float32, name="tf_keepprob")
        
        # Creating the embedding layer
        embedding = tf.Variable(tf.random_uniform((self.n_words, self.embed_size), minval=-1, maxval=1),
                               name="embedding")
        embed_x = tf.nn.embedding_lookup(embedding, tf_x, name='embedded_x')
        
        # Define the LSTM cell and stack them together  tf.nn.rnn_cell.LSTMCell
        cells = tf.contrib.rnn.MultiRNNCell([tf.contrib.rnn.DropoutWrapper(tf.nn.rnn_cell.LSTMCell(
        self.lstm_size), output_keep_prob=tf_keepprob) for i in range(self.num_layers)])
        
        # Define the initial state:
        self.initial_state = cells.zero_state(self.batch_size, tf.float32)
        print('  << initial state >>  ', self.initial_state)
        lstm_outputs, self.final_state = tf.nn.dynamic_rnn(cells, embed_x, initial_state=self.initial_state)
        
        # Note: lstm_outputs shape = [batch_size, max_time, cells.output_size]
        print("\n << lstm_output >> ", lstm_outputs)
        print("\n << final state >> ", self.final_state)
        
        logits = tf.layers.dense(inputs=lstm_outputs[:, -1], units=1, activation=None, name="logits")
        
        logits = tf.squeeze(logits, name="logits_squeezed")
        print("\n << logits >> ", logits)
        
        y_proba = tf.nn.sigmoid(logits, name="probabilities")
        predictions = {"probabilities": y_proba,
                      "labels": tf.cast(tf.round(y_proba), tf.int32, name="labels")}
        print("\n << predictions >> ", predictions)
        
        # Define the cost function
        cost = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels=tf_y, logits=logits), name="cost")
        
        # Define the optimizer
        optimizer = tf.train.AdamOptimizer(self.learning_rate)
        train_op = optimizer.minimize(cost, name="train_op")
    
    def train(self, X_train, y_train, num_epochs):
        with tf.Session(graph=self.g) as sess:
            sess.run(self.init_op)
            iteration = 1
            for epoch in range(num_epochs):
                state = sess.run(self.initial_state)
                
                for batch_x, batch_y in create_batch_generator(X_train, y_train, self.batch_size):
                    feed = {"tf_x:0": batch_x, "tf_y:0": batch_y, "tf_keepprob:0": 0.5, self.initial_state: state}
                    loss, _, state = sess.run(["cost:0", "train_op", self.final_state], feed_dict=feed)
                    
                    if not iteration % 20:
                        print("Epoch: {}/{} Iteration: {} | Train loss: {:.5f}".format(epoch + 1,
                                                                                       num_epochs,
                                                                                       iteration,
                                                                                       loss))
                    iteration += 1
                if not (epoch + 1) % 10:
                    self.saver.save(sess, "model/sentiment-{}.ckpt".format(epoch))
    
    def predict(self, X_data, return_proba=False):
        preds = []
        with tf.Session(graph=self.g) as sess:
            self.saver.restore(sess, tf.train.latest_checkpoint("./model/"))
            test_state = sess.run(self.initial_state)
            for ii, batch_x in enumerate(create_batch_generator(X_data, None, batch_size=self.batch_size), 1):
                feed = {"tf_x:0": batch_x, "tf_keepprob:0": 1.0, self.initial_state: test_state}
                if return_proba:
                    pred, test_state = sess.run(["probabilities:0", self.final_state], feed_dict=feed)
                else:
                    pred, test_state = sess.run(["labels:0", self.final_state], feed_dict=feed)
                
                preds.append(pred)
        
        return np.concatenate(preds)

- `n_words` must be set equal to the number of unique words (plus 1 since we use zero to fill sequences whose size is less than 200).  It is used while creating the embedding layer along with the `embed_size` hyperparameter.
- `seq_len` variable must be set according to the length of the sequences that were created in the preprocessing steps we went through previously.
- `lstm_size` is another hyperparameter that determines the number of hidden units in each RNN layer.

## The build method

Next, let us discuss the `build` method for our `SentimentRNN` class.  This is the longest and most critical method in our sequence, so we'll be going through it in plenty of detail.  First we'll look at the code in full, so we can see the whole picture, and then we'll analyze each of its main parts.

So first of all in our `build` method here, we created three placeholders, namely `tf_x`, `tf_y`, and `tf_keepprob`, which we need for feeding the input data.  Then we aded the embedding layer, which builds the embedded representation `embed_x`, as we discussed earlier.

Next, in our `build` method, we built the RNN network with LSTM cells.  We did this in three steps:
1. First, we defined the multilayer RNN cells.
2. Next, we defined the initial state for these cells.
3. Finally, we created an RNN specified by the RNN cells and their initial states.

Let's break these three steps out in deail in the following three sections, so we can examine in depth how we built the RNN network in our `build` method.

### Step 1---Defining multilayer RNN cells.

To examine how we coded our `build` method to build the RNN network, the first step was to define our multilayer RNN cells.

Fortunately, TensorFlow has a very nice wrapper class to define LSTM cells---the `BasicLSTMCell` class---which can be stacked together to from a multilayer RNN using the `MultiRNNCell` wrapper class.  The process of stacking RNN cells with a dropout has three nested steps; these three nested steps can be described from the inside out as follows:
1. First, create the RNN cells using `tf.contrib.rnn.BasicLSTMCell`.
2. Apply the dropout to the RNN cells using `tf.contrib.rnn.DropoutWrapper`.
3. Make a list of such cells according to the desired number of RNN layers, and pass this list to `tf.contrib.rnn.MultiRNNCell`.

In our `build` method code, this list is created using Python list comprehension.  Note that for a single layer, this list has oly one cell.

### Step 2---Defining the initial states for the RNN cells.

The second step that our `build` method takes to build the RNN network was to define the initial states for the RNN cells.

You will recall from the architecture of LSTM cells, there are 3 types of inputs in an LSTM cell---input data $\mathbf{x}^{(t)}$, activations of hidden units from the previous time step $\mathbf{h}^{(t-1)}$, and the cell state from the previous time step $\mathbf{C}^{(t-1)}$.

Therefore, in our `build` method implementation, $\mathbf{x}^{(t)}$ is the embedded `embed_x` data tensor.  When we start processing a new input sequence, we initialize the cell states to zero state; then after each time step we need to store the updated state of the cells to use for the next time step.

Once our multilayer RNN object is defined (`cells` in our implementation), we define its initial state in our `build` method using the `cells.zero_state` method.

### Step 3---Creating the RNN using the RNN cells and their states.

The third step to creating the RNN in our `build` method, used the `tf.nn.dynamic_rnn` function to pull together all our components.

The `tf.nn.dynamic_rnn` function therefore pulls the embedded data, the RNN cells, and their initial states, and creates a pipeline for them according to the unrolled architecture of LSTM cells.

The `tf.nn.dynamic_rnn` functionreturns a tuple containing the activations of the RNN cells, `outputs`; and their final states, `state`.  The output is a three-dimensional tensor with the shape `(batch_size, num_steps, lstm_size)`.  We pass `outputs` to a fully connected layer to get logits, and we store the final state to use as the initial state of the next mini-batch of data.

Finally, in our build method, after setting up the RNN components of the network, the cost function and optimization schemes can be defined like any other neural network.

## The train method

The next method in our `SentimentRNN` class is `train`.  This method is quite similar to the train methods we created in _Chapter 14_ and _Chapter 15_, except that we have an additional tensor, `state`, that we feed into our network.

The code above shows the implementation of the `train` method...

In this implementation of our `train` method, at the beginning of each epoch, we start from the zero states of RNN cells as our current state.  Running each mini-batch of data is performed by feeding the current state along with the data `batch_x` and their labels `batch_y`.  Upon finishing the execution of a mini-batch, we update the state to be the final state, which is returned by the `tf.nn.dynamic_rnn` function.  This updated state will be used toward execution of the next mini-batch.  This process is repeated and the current state is updated throughout the epoch.

## The predict method

Finally, the last method in our `SentimentRNN` class is the `predict` method, which keeps updating the current state similar to the `train` method, shown in the code above...

## Instantiating the SentimentRNN class

We've now coded and examined all four parts of our `SentimentRNN` class, which were the class constructor, the `build` method, the `train` method, and the `predict` method.

We are now ready to create an object of the class `SentimentRNN`, with parameters as follows:

In [33]:
n_words = max(list(word_to_int.values())) + 1

rnn = SentimentRNN(n_words=n_words,
                  seq_len=sequence_length,
                  embed_size=256,
                  lstm_size=128,
                  num_layers=1,
                  batch_size=100,
                  learning_rate=0.001)

  << initial state >>   (LSTMStateTuple(c=<tf.Tensor 'MultiRNNCellZeroState/DropoutWrapperZeroState/LSTMCellZeroState/zeros:0' shape=(100, 128) dtype=float32>, h=<tf.Tensor 'MultiRNNCellZeroState/DropoutWrapperZeroState/LSTMCellZeroState/zeros_1:0' shape=(100, 128) dtype=float32>),)

 << lstm_output >>  Tensor("rnn/transpose_1:0", shape=(100, 200, 128), dtype=float32)

 << final state >>  (LSTMStateTuple(c=<tf.Tensor 'rnn/while/Exit_3:0' shape=(100, 128) dtype=float32>, h=<tf.Tensor 'rnn/while/Exit_4:0' shape=(100, 128) dtype=float32>),)

 << logits >>  Tensor("logits_squeezed:0", shape=(100,), dtype=float32)

 << predictions >>  {'probabilities': <tf.Tensor 'probabilities:0' shape=(100,) dtype=float32>, 'labels': <tf.Tensor 'labels:0' shape=(100,) dtype=int32>}


Notice here that we use `num_layers=1` to use a single RNN layer.  Although our implementation allows us to create multilayer RNNs, by setting `num_layers` greater than 1.  Here we should consider the small size of our dataset, and that a single RNN may generalise better to unseen data, since it is less likely to overfit the training data.

## Training and optimising the sentiment analysis RNN model

Next, we can train the RNN model by calling the `rnn.train` function.  In the following code, we train the model for `40` epochs, using the input from `X_train` and the corresponding class labels stored in `y_train`:

In [34]:
rnn.train(X_train, y_train, num_epochs=40)

Epoch: 1/40 Iteration: 20 | Train loss: 0.70071
Epoch: 1/40 Iteration: 40 | Train loss: 0.63536
Epoch: 1/40 Iteration: 60 | Train loss: 0.67273
Epoch: 1/40 Iteration: 80 | Train loss: 0.57852
Epoch: 1/40 Iteration: 100 | Train loss: 0.55505
Epoch: 1/40 Iteration: 120 | Train loss: 0.53757
Epoch: 1/40 Iteration: 140 | Train loss: 0.59300
Epoch: 1/40 Iteration: 160 | Train loss: 0.48948
Epoch: 1/40 Iteration: 180 | Train loss: 0.52614
Epoch: 1/40 Iteration: 200 | Train loss: 0.55652
Epoch: 1/40 Iteration: 220 | Train loss: 0.40346
Epoch: 1/40 Iteration: 240 | Train loss: 0.46770
Epoch: 2/40 Iteration: 260 | Train loss: 0.49523
Epoch: 2/40 Iteration: 280 | Train loss: 0.43171
Epoch: 2/40 Iteration: 300 | Train loss: 0.40109
Epoch: 2/40 Iteration: 320 | Train loss: 0.43319
Epoch: 2/40 Iteration: 340 | Train loss: 0.45512
Epoch: 2/40 Iteration: 360 | Train loss: 0.30557
Epoch: 2/40 Iteration: 380 | Train loss: 0.41766
Epoch: 2/40 Iteration: 400 | Train loss: 0.36661
Epoch: 2/40 Iteration: 4

Epoch: 14/40 Iteration: 3300 | Train loss: 0.00135
Epoch: 14/40 Iteration: 3320 | Train loss: 0.00058
Epoch: 14/40 Iteration: 3340 | Train loss: 0.04257
Epoch: 14/40 Iteration: 3360 | Train loss: 0.00193
Epoch: 14/40 Iteration: 3380 | Train loss: 0.05744
Epoch: 14/40 Iteration: 3400 | Train loss: 0.00159
Epoch: 14/40 Iteration: 3420 | Train loss: 0.06358
Epoch: 14/40 Iteration: 3440 | Train loss: 0.03425
Epoch: 14/40 Iteration: 3460 | Train loss: 0.05493
Epoch: 14/40 Iteration: 3480 | Train loss: 0.01376
Epoch: 14/40 Iteration: 3500 | Train loss: 0.00796
Epoch: 15/40 Iteration: 3520 | Train loss: 0.02557
Epoch: 15/40 Iteration: 3540 | Train loss: 0.00461
Epoch: 15/40 Iteration: 3560 | Train loss: 0.10629
Epoch: 15/40 Iteration: 3580 | Train loss: 0.04537
Epoch: 15/40 Iteration: 3600 | Train loss: 0.02648
Epoch: 15/40 Iteration: 3620 | Train loss: 0.04706
Epoch: 15/40 Iteration: 3640 | Train loss: 0.00507
Epoch: 15/40 Iteration: 3660 | Train loss: 0.01458
Epoch: 15/40 Iteration: 3680 | 

Epoch: 27/40 Iteration: 6520 | Train loss: 0.00005
Epoch: 27/40 Iteration: 6540 | Train loss: 0.00002
Epoch: 27/40 Iteration: 6560 | Train loss: 0.00008
Epoch: 27/40 Iteration: 6580 | Train loss: 0.00005
Epoch: 27/40 Iteration: 6600 | Train loss: 0.00001
Epoch: 27/40 Iteration: 6620 | Train loss: 0.00046
Epoch: 27/40 Iteration: 6640 | Train loss: 0.00001
Epoch: 27/40 Iteration: 6660 | Train loss: 0.00008
Epoch: 27/40 Iteration: 6680 | Train loss: 0.00003
Epoch: 27/40 Iteration: 6700 | Train loss: 0.00002
Epoch: 27/40 Iteration: 6720 | Train loss: 0.00001
Epoch: 27/40 Iteration: 6740 | Train loss: 0.00003
Epoch: 28/40 Iteration: 6760 | Train loss: 0.00002
Epoch: 28/40 Iteration: 6780 | Train loss: 0.00009
Epoch: 28/40 Iteration: 6800 | Train loss: 0.00001
Epoch: 28/40 Iteration: 6820 | Train loss: 0.00001
Epoch: 28/40 Iteration: 6840 | Train loss: 0.00010
Epoch: 28/40 Iteration: 6860 | Train loss: 0.00003
Epoch: 28/40 Iteration: 6880 | Train loss: 0.00009
Epoch: 28/40 Iteration: 6900 | 

Epoch: 39/40 Iteration: 9740 | Train loss: 0.00182
Epoch: 40/40 Iteration: 9760 | Train loss: 0.00098
Epoch: 40/40 Iteration: 9780 | Train loss: 0.00105
Epoch: 40/40 Iteration: 9800 | Train loss: 0.00419
Epoch: 40/40 Iteration: 9820 | Train loss: 0.00281
Epoch: 40/40 Iteration: 9840 | Train loss: 0.00177
Epoch: 40/40 Iteration: 9860 | Train loss: 0.00334
Epoch: 40/40 Iteration: 9880 | Train loss: 0.01384
Epoch: 40/40 Iteration: 9900 | Train loss: 0.00191
Epoch: 40/40 Iteration: 9920 | Train loss: 0.00464
Epoch: 40/40 Iteration: 9940 | Train loss: 0.00070
Epoch: 40/40 Iteration: 9960 | Train loss: 0.00126
Epoch: 40/40 Iteration: 9980 | Train loss: 0.00028
Epoch: 40/40 Iteration: 10000 | Train loss: 0.00082


The trained model is saved using TensorFlow's checkpointing system, which we discussed in Chapter 14.  Now we can use the trained model for predicting the class labels on the test set, as follows:

In [35]:
preds = rnn.predict(X_test)
y_true = y_test[:len(preds)]
print("Test accuracy:  {:.3f}.".format(np.sum(preds == y_true) / len(y_true)))

INFO:tensorflow:Restoring parameters from ./model/sentiment-39.ckpt
Test accuracy:  0.853.


The result will show an accuracy of 85 percent.  Given the small size of this dataset, this is comparable to the test prediction accuracy obtained in Chapter 8.

We can optimise this further by changing the hyperparameters of the model, such as `lstm_size`, `seq_len`, and `embed_size`, to achieve better generalisation performance.  However, for hyperparameter tuning, it is recommended that we create a separate validation set and that we don't repeatedly use the test set for evaluation to avoid introducing bias through test data leakage, which we discuess in Chapter 6.

Also, if you're interested in the prediction probabilities on the test set rather than the class labels, then you can set `return_proba=True` as follows:

In [36]:
proba = rnn.predict(X_test, return_proba=True)
print(proba)

INFO:tensorflow:Restoring parameters from ./model/sentiment-39.ckpt
[5.57116550e-07 9.99999046e-01 1.85828488e-02 ... 2.16272201e-06
 1.14251845e-04 9.95571196e-01]


So this was our first RNN model for sentiment analysis.  We'll now go further and create an RNN for character-bycharacter language modeling in TensorFlow, as another popular application of sequence modeling.

## Project two---Implementing an RNN for chatacter-level language modeling in TensorFlow.

Language modeling is a fascinating application that enables machines to perform human-language-related tasks, such as generating English sentences.  One of the most interesting efforts in this area is the work done by Sutskever, Martens, and Hinton (_Generating Text with Recurrent Neural Networks, Ilya Sutskever, James Martens, and Geoffrey E. Hinton, Proceedings of the 28th International Conference on Machine Learning (ICML-11), 2011_ https://pdfs.semanticscholar.org/93c2/0e38c85b69fc2d2eb314b3c1217913f7db11.pdf).

In the model that we'll build now, the input is a text document, and our goal is to develop a model that can generate new text siilar to the input document.  Examples of such an input can be a book or a computer program in a specific programming language.

In the character-level language modeling, the input is broken down into a sequence of characters that are fed into our network one characer at a time.  The network will process each new character in conjunction with the memory of the previously seen characters to predict the next character.  The following figure shows an example of character-level language modeling.

<img src="images/16_11.png" style="width:500px">

We can break this implementation down into three separate steps---preparing the data, building the RNN model, and performing next-character prediction and sampling to generate new text.

If you recall from the previous sections of this chapter, we mentioned the exploding gradient problem.  In this application, we'll also get a chance to play with a gradient clipping technique to avoid this exploding gradient problem.

## Preparing the Data

In this section, we prepare the data fro character-level language modeling.

To get the input data, visit the Project Gutenberg website at https://www.gutenberg.org/, which provides thousands of free e-books.  For our example, we can get the book _The Tragedie of Hamlet_ by William Shakespeare in plain text format from http://www.gutenberg.org/cache/epub/2265/pg2265.txt.

Note that this link will directly take you to the download page.  If you are using macOS or a Linux operating system, you can download the file with the following command in the Terminal:  
`curl https://www.gutenberg.org/cache/epub/2265/pg2265.txt > pg2265.txt`

If this resource becomes unavailable in the future, a copy of this text is also included in this chapter's code directory in the book's code repository at https://github.com/rasbt/python-machine-learning-book-2nd-edition.

Once we have some data, we can read it into a Python session as plain text.  In the following code, the Python variable `chars` represents the set of unique characters observed in this text.  We then create a dictionary that maps each character to an integer, `char2int`, and a dictionary that performs reverse mapping, for instance, mapping integers to those unique characters---`int2char`.  Using the `char2int` dictionary, we convert the text into a NumPy array of integers.  The following figure shows an example of converting characters into integers and the reverse for the words `"Hello"` and `"world"`:

<img src="images/16_12.png" style="width:500px">

The code below reads the text from the downloaded link, removes the beginning portion of the text that contains some legal description of the Gutenberg project, and then constructs the dictionaries based on the text:

In [47]:
import numpy as np

# Reading and processing the text
with open("1257.txt", "r", encoding="utf-8") as f:
    text = f.read()
# text = text[15858:]
chars = set(text)

char2int = {ch: i for i, ch in enumerate(chars)}
int2char = dict(enumerate(chars))
text_ints = np.array([char2int[ch] for ch in text], dtype=np.int32)

Now, we should reshape the data into batches of sequences, the most important step in preparing the data.  As we know, the goal is to predict the next character, based on the sequene of characters that we have observed so far.  Therefore, we shift the input $(\mathbf{x})$ and output $(\mathbf{y})$ of the neural network by one character.  The following figure shows the preprocessing steps, starting from a text corpus to generating data arrays for $\mathbf{x}$ and $\mathbf{y}$:

<img src="images/16_13.png" style="width:500px">

As you can see in this figure, the training arrays $\mathbf{x}$ and $\mathbf{y}$ have the same shapes or dimensions, where the number of rows is equal to the _batch size_ and the number of columns is equal to _number of batches$\times$number of steps_.

Given the input array `data` that contains the integers that correspond to the characters in the text corpus, the following function will generate `x` and `y` with the same structure shown in the previous figure.

In [48]:
def reshape_data(sequence, batch_size, num_steps):
    tot_batch_length = batch_size * num_steps
    num_batches = int(len(sequence) / tot_batch_length)
    if num_batches*tot_batch_length + 1 > len(sequence):
        num_batches = num_batches - 1
    
    # Truncatethe sequence at the end to get rid of the remaining characters that do not make a full batch
    x = sequence[0: num_batches*tot_batch_length]
    y = sequence[1: num_batches*tot_batch_length + 1]
    
    # Split x and y into a list batches of sequences:
    x_batch_splits = np.split(x, batch_size)
    y_batch_splits = np.split(y, batch_size)
    
    # Stack the batches together batch_size x tot_batch_length
    x = np.stack(x_batch_splits)
    y = np.stack(y_batch_splits)
    
    return x, y

The next step is to split the arrays $\mathbf{x}$ and $\mathbf{y}$ into mini-batches, where each row is a sequence with length equal to the _number of steps_.  The process of splitting the data array $\mathbf{x}$ is shown in the following figure:

<img src="images/16_14.png" style="width:500px">

In the following code, we define a function named `create_batch_generator` that splits the data arrays $\mathbf{x}$ and $\mathbf{y}$, as shown in the previous figure, and outputs a batch generator.  Later, we will use this generator to iterate through the mini-batches during the training of our network:

In [49]:
def create_batch_generator(data_x, data_y, num_steps):
    batch_size, tot_batch_length = data_x.shape
    num_batches = int(tot_batch_length / num_steps)
    for b in range(num_batches):
        yield (data_x[:, b*num_steps:(b+1)*num_steps],
               data_y[:, b*num_steps:(b+1)*num_steps])

At this point, we've now completed the data preprocessing steps, and we have the data in the proper format.  In the next section, we'll implement the RNN model for character-level language modeling.

## Building a character-level RNN model

To build a character-level neural network, we'll implement a class called `CharRNN` that constructs the graph of the RNN in order to predict the next character, after observing a given sequence of characters.  From the classification perspective, the number of classes is the total number of unique characters that exists in the text corpus.  The `CharRNN` class has four method, as follows:
- A constructor that sets up the learning parameters, creates a computation graph, and calls the `build` method to construct the graph based on the sampling mode versus the training mode.
- A `build` method that defines the placeholders for feeding the data, constructs the RNN using LSTM cells, and defines the output of the network, the cost function, and the optimizer.
- A `train` method to iterate through the mini-batches, and train the network for the specified number of epochs.
- A `sample` method to start from a given string, calculate the probabilities for the next character, and choose a character randomly according to these probabilities.  This process will be repeated, and the sampled characters will be concatenated together to form a string.  Once the size of this string reaches the specified length, it will return the string.

We'll break these four methods into separate code sections and explain each one.  Note that implementing the RNN part of this model is very similar to the implementation in the _Project one---Performing sentiment analysis of IMDb movie reviews using multilayer RNNs_ section.  So, we'll skip the description of building the RNN components here.

## Some code that needs to be defined now; this is explained at the end.

In this `get_top_char` function, the probabilities are first sorted, then the `top_n` probabilities are passed to the `numpy.random.choice` function to randomly select one out of these top probabilities.  The implementation of the `get_top_char` function is as follows.

Note, of course, that this function should be defined _before_ the definition of the `CharRNN` class.

In [50]:
def get_top_char(probas, char_size, top_n=5):
    p = np.squeeze(probas)
    p[np.argsort(p)[:-top_n]] = 0.0
    p = p / np.sum(p)
    ch_id = np.random.choice(char_size, 1, p=p)[0]
    return ch_id

## The constructor

In contrast to our previous implementation for sentiment analysis, where the same computation graph was used for both training and prediction modes, this time our computation graph is going to be different for the training versus sampling mode.

Therefore, we need to add a new Boolean type argument to the constructor, to determine whether we're building the model for the training mode or the sampling mode.  The following code shows the implementation of the constructor enclosed in the class definition, amongst others...

As we planned earlier, the Boolean `sampling` argument is used to determine whether the instance of `CharRNN` is for building the graph in the training mode (`sampling=False`), or the sampling mode (`sampling=True`).

In addition to the `sampling` argument, we've introduced a new argument called `grad_clip`, which is used for clipping the gradients to avoid the exploding gradient problem that we mentioned earlier.

Then, similar to the previous implementation, the constructor creates a computation graph, sets the graph-level random seed, for consistent output, and build the graph by calling the `build` method.

## The build method

The next method of the `CharRNN` class is `build`, which is very similar to the `build` method in the _Project one---performing sentiment analysis of IMDb movie reviews using multilayer RNNs_ section, except for some minor differences.  The `build` method first defines two local variables, `batch_size`, and `num_steps`, based on the mode, as follows:

$$
\mathrm{in\ sampling\ mode}:
\begin{cases}
    \mathrm{batch\_size} = 1 \\
    \mathrm{num\_steps} = 1
\end{cases}
$$

$$
\mathrm{in\ training\ mode}:
\begin{cases}
    \mathrm{batch\_size} = \mathrm{self.batch\_size} \\
    \mathrm{num\_steps} = \mathrm{self.num\_steps}
\end{cases}
$$

Recall that in the sentiment analysis implementation, we used an embedding layer to create a salient representation for the unique words in the dataset.  In contrast, here we are using the one-hot encoding scheme for both $x$ and $y$ with `depth=num_classes`, where `num_classes` is in fact the total number of characters in the text corpus.

Building a multilayer RNN component of the model is exactly the same as in our sentiment analysis implementation, using the `tf.nn.dynamic_rnn` function.  However, `outputs` from the `tf.nn.dynamic_nn` function is a three-dimensional tensor with this shape---`batch_size, num_steps, lstm_size`.  Next, this tensor will be reshaped into a two-dimensional tensor with the `batch_size*num_steps, lstm_size` shape, which is passed to the `tf.layers.dense` function to make a fully connected layer and obtain `logits` (net inputs).  Finally, the probabilities for the next batch of characters are obtained and the cost function is defined.  In addition, here, we apply gradient clipping using the `tf.clip_by_global_norm` function to avoid the exploding gradient problem.

## The train method

The next method of the `CharRNN` class is the `train` method, which is very similar to the `train` method described in the `Project one---performing sentiment analysis of IMDb movie reviews using multilayer RNNs` section.

##  The sample method

The final method in our `CharRNN` class is the `sample` method.  The behavior of this `sample` method is similar to that of the `predict` method we implemented in the `Project one---performing sentiment analysis of IMDb movie review using multilayer RNNs` section.  However, the difference here is that we calculate the probabilities for the next character from an observed sequence---`observed_seq`.  Then, these probabilities are passes to a function named `get_top_char`, which randomly selects one character acording to the obtained probabilities.

Initially, the observed sequence starts from `starter_seq`, which is provided as an argument.  When new characters are sampled according to their predicted probabilities, they are appended to the observed sequence, and the new observed sequence is used for predicting the next character.

So here, the `sample` method calls the `get_top_char` function to choose a character ID randomly (`ch_id`) according to the obtained probabilities.

In this `get_top_char` function, the probabilities are first sorted, then the `top_n` probabilities are passed to the `numpy.random.choice` function to randomly select one out of these probabilities.  

In [51]:
import tensorflow as tf
import os

class CharRNN(object):
    def __init__(self, num_classes, batch_size=64, num_steps=100, lstm_size=128,
                num_layers=1, learning_rate=0.001, keep_prob=0.5, grad_clip=5, sampling=False):
        self.num_classes = num_classes
        self.batch_size = batch_size
        self.num_steps = num_steps
        self.lstm_size = lstm_size
        self.num_layers = num_layers
        self.learning_rate = learning_rate
        self.keep_prob = keep_prob
        self.grad_clip = grad_clip
        
        self.g = tf.Graph()
        with self.g.as_default():
            tf.set_random_seed(123)
            
            self.build(sampling=sampling)
            
            self.saver = tf.train.Saver()
            
            self.init_op = tf.global_variables_initializer()
    
    def build(self, sampling):
        if sampling:
            batch_size, num_steps = 1, 1
        else:
            batch_size = self.batch_size
            num_steps = self.num_steps
        
        tf_x = tf.placeholder(tf.int32, shape=[batch_size, num_steps], name="tf_x")
        tf_y = tf.placeholder(tf.int32, shape=[batch_size, num_steps], name="tf_y")
        tf_keepprob = tf.placeholder(tf.float32, name="tf_keepprob")
        
        # One-hot encoding
        x_onehot = tf.one_hot(tf_x, depth=self.num_classes)
        y_onehot = tf.one_hot(tf_y, depth=self.num_classes)
        
        # Build the multi-layer RNN cells
        cells = tf.contrib.rnn.MultiRNNCell([tf.contrib.rnn.DropoutWrapper(tf.nn.rnn_cell.LSTMCell(
        self.lstm_size), output_keep_prob=tf_keepprob) for _ in range(self.num_layers)])
        
        # Define the initial state
        self.initial_state = cells.zero_state(batch_size, tf.float32)
        
        # Run each sequence step through the RNN
        lstm_outputs, self.final_state = tf.nn.dynamic_rnn(cells, x_onehot, initial_state=self.initial_state)
        
        print("  << lstm_outputs >>", lstm_outputs)
        
        seq_output_reshaped = tf.reshape(lstm_outputs, shape=[-1, self.lstm_size], name="seq_out_reshaped")
        
        logits = tf.layers.dense(inputs=seq_output_reshaped, units=self.num_classes,
                                 activation=None, name="logits")
        
        proba = tf.nn.softmax(logits, name="probabilities")
        
        y_reshaped = tf.reshape(y_onehot, shape=[-1, self.num_classes], name="y_reshaped")
        
        cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=y_reshaped),
                             name="cost")
        
        # Gradient clipping to avoid "exploding gradient"
        tvars = tf.trainable_variables()
        grads, _ = tf.clip_by_global_norm(tf.gradients(cost, tvars), self.grad_clip)
        optimizer = tf.train.AdamOptimizer(self.learning_rate)
        train_op = optimizer.apply_gradients(zip(grads, tvars), name="train_op")
    
    def train(self, train_x, train_y, num_epochs, ckpt_dir="./model/"):
        # Create the checkpoint directory if it does not exist
        if not os.path.exists(ckpt_dir):
            os.mkdir(ckpt_dir)
        
        with tf.Session(graph=self.g) as sess:
            sess.run(self.init_op)
            
            n_batches = int(train_x.shape[1]/self.num_steps)
            iterations = n_batches * num_epochs
            for epoch in range(num_epochs):
                
                # Train the network
                new_state = sess.run(self.initial_state)
                loss = 0
                # Mini-batch generator
                bgen = create_batch_generator(train_x, train_y, self.num_steps)
                for b, (batch_x, batch_y) in enumerate(bgen, 1):
                    iteration = epoch * n_batches + b
                    
                    feed = {"tf_x:0": batch_x, "tf_y:0": batch_y, "tf_keepprob:0": self.keep_prob,
                           self.initial_state: new_state}
                    batch_cost, _, new_state = sess.run(["cost:0", "train_op", self.final_state], feed_dict=feed)
                    if not iteration % 10:
                        print("Epoch {}/{} Iteration {} | Training Loss: {:.4f}.".format(epoch + 1, num_epochs,
                                                                                        iteration, batch_cost))
                
                # Save the trained model
                self.saver.save(sess, os.path.join(ckpt_dir, "language_modeling.ckpt"))
    
    def sample(self, output_length, ckpt_dir, starter_seq="Men "):
        observed_seq = [ch for ch in starter_seq]
        with tf.Session(graph=self.g) as sess:
            self.saver.restore(sess, tf.train.latest_checkpoint(ckpt_dir))
            
            # 1:  Run the model using the starter seq
            new_state = sess.run(self.initial_state)
            for ch in starter_seq:
                x = np.zeros((1, 1))
                x[0, 0] = char2int[ch]
                feed = {"tf_x:0": x, "tf_keepprob:0": 1.0, self.initial_state: new_state}
                proba, new_state = sess.run(["probabilities:0", self.final_state], feed_dict=feed)
            
            ch_id = get_top_char(proba, len(chars))
            observed_seq.append(int2char[ch_id])
            
            # 2:  Run the model using the updated ovserved_seq
            for i in range(output_length):
                x[0, 0] = ch_id
                feed = {"tf_x:0": x, "tf_keepprob:0": 1.0, self.initial_state: new_state}
                proba, new_state = sess.run(["probabilities:0", self.final_state], feed_dict=feed)
                
                ch_id = get_top_char(proba, len(chars))
                observed_seq.append(int2char[ch_id])
        return "".join(observed_seq)

## Creating and Training the CharRNN Model.

Now we are ready to create an instance of the `CharRNN` class to build the RNN model, and to train it with the following configurations:

In [52]:
batch_size = 64
num_steps = 100
train_x, train_y = reshape_data(text_ints, batch_size, num_steps)

rnn = CharRNN(num_classes=len(chars), batch_size=batch_size)
rnn.train(train_x, train_y, num_epochs=500, ckpt_dir="./model-500/")

  << lstm_outputs >> Tensor("rnn/transpose_1:0", shape=(64, 100, 128), dtype=float32)
Epoch 1/500 Iteration 10 | Training Loss: 3.9287.
Epoch 1/500 Iteration 20 | Training Loss: 3.4434.
Epoch 1/500 Iteration 30 | Training Loss: 3.3455.
Epoch 1/500 Iteration 40 | Training Loss: 3.2452.
Epoch 1/500 Iteration 50 | Training Loss: 3.2319.
Epoch 1/500 Iteration 60 | Training Loss: 3.2077.
Epoch 1/500 Iteration 70 | Training Loss: 3.1810.
Epoch 1/500 Iteration 80 | Training Loss: 3.1811.
Epoch 1/500 Iteration 90 | Training Loss: 3.1635.
Epoch 1/500 Iteration 100 | Training Loss: 3.1477.
Epoch 1/500 Iteration 110 | Training Loss: 3.1109.
Epoch 1/500 Iteration 120 | Training Loss: 3.1414.
Epoch 1/500 Iteration 130 | Training Loss: 3.0823.
Epoch 1/500 Iteration 140 | Training Loss: 3.0749.
Epoch 1/500 Iteration 150 | Training Loss: 3.0285.
Epoch 1/500 Iteration 160 | Training Loss: 3.0779.
Epoch 1/500 Iteration 170 | Training Loss: 2.9796.
Epoch 1/500 Iteration 180 | Training Loss: 2.9406.
Epoch

Epoch 8/500 Iteration 1590 | Training Loss: 2.0520.
Epoch 8/500 Iteration 1600 | Training Loss: 2.0289.
Epoch 8/500 Iteration 1610 | Training Loss: 2.0188.
Epoch 8/500 Iteration 1620 | Training Loss: 2.0283.
Epoch 8/500 Iteration 1630 | Training Loss: 2.0323.
Epoch 9/500 Iteration 1640 | Training Loss: 1.9939.
Epoch 9/500 Iteration 1650 | Training Loss: 1.9372.
Epoch 9/500 Iteration 1660 | Training Loss: 1.9978.
Epoch 9/500 Iteration 1670 | Training Loss: 1.9892.
Epoch 9/500 Iteration 1680 | Training Loss: 2.0404.
Epoch 9/500 Iteration 1690 | Training Loss: 1.9815.
Epoch 9/500 Iteration 1700 | Training Loss: 2.0077.
Epoch 9/500 Iteration 1710 | Training Loss: 2.0754.
Epoch 9/500 Iteration 1720 | Training Loss: 1.9865.
Epoch 9/500 Iteration 1730 | Training Loss: 1.9849.
Epoch 9/500 Iteration 1740 | Training Loss: 1.9847.
Epoch 9/500 Iteration 1750 | Training Loss: 1.9424.
Epoch 9/500 Iteration 1760 | Training Loss: 1.9873.
Epoch 9/500 Iteration 1770 | Training Loss: 1.9932.
Epoch 9/500 

Epoch 16/500 Iteration 3150 | Training Loss: 1.7852.
Epoch 16/500 Iteration 3160 | Training Loss: 1.7855.
Epoch 16/500 Iteration 3170 | Training Loss: 1.7465.
Epoch 16/500 Iteration 3180 | Training Loss: 1.7764.
Epoch 16/500 Iteration 3190 | Training Loss: 1.7605.
Epoch 16/500 Iteration 3200 | Training Loss: 1.7830.
Epoch 16/500 Iteration 3210 | Training Loss: 1.7968.
Epoch 16/500 Iteration 3220 | Training Loss: 1.8351.
Epoch 16/500 Iteration 3230 | Training Loss: 1.7830.
Epoch 16/500 Iteration 3240 | Training Loss: 1.7855.
Epoch 16/500 Iteration 3250 | Training Loss: 1.8077.
Epoch 16/500 Iteration 3260 | Training Loss: 1.8112.
Epoch 17/500 Iteration 3270 | Training Loss: 1.8045.
Epoch 17/500 Iteration 3280 | Training Loss: 1.7947.
Epoch 17/500 Iteration 3290 | Training Loss: 1.7890.
Epoch 17/500 Iteration 3300 | Training Loss: 1.7635.
Epoch 17/500 Iteration 3310 | Training Loss: 1.8201.
Epoch 17/500 Iteration 3320 | Training Loss: 1.7686.
Epoch 17/500 Iteration 3330 | Training Loss: 1

Epoch 24/500 Iteration 4700 | Training Loss: 1.6788.
Epoch 24/500 Iteration 4710 | Training Loss: 1.6191.
Epoch 24/500 Iteration 4720 | Training Loss: 1.6954.
Epoch 24/500 Iteration 4730 | Training Loss: 1.6932.
Epoch 24/500 Iteration 4740 | Training Loss: 1.7583.
Epoch 24/500 Iteration 4750 | Training Loss: 1.6929.
Epoch 24/500 Iteration 4760 | Training Loss: 1.7391.
Epoch 24/500 Iteration 4770 | Training Loss: 1.7522.
Epoch 24/500 Iteration 4780 | Training Loss: 1.6975.
Epoch 24/500 Iteration 4790 | Training Loss: 1.6887.
Epoch 24/500 Iteration 4800 | Training Loss: 1.6882.
Epoch 24/500 Iteration 4810 | Training Loss: 1.6324.
Epoch 24/500 Iteration 4820 | Training Loss: 1.7009.
Epoch 24/500 Iteration 4830 | Training Loss: 1.7017.
Epoch 24/500 Iteration 4840 | Training Loss: 1.6450.
Epoch 24/500 Iteration 4850 | Training Loss: 1.6985.
Epoch 24/500 Iteration 4860 | Training Loss: 1.7043.
Epoch 24/500 Iteration 4870 | Training Loss: 1.6890.
Epoch 24/500 Iteration 4880 | Training Loss: 1

Epoch 31/500 Iteration 6250 | Training Loss: 1.5915.
Epoch 31/500 Iteration 6260 | Training Loss: 1.6144.
Epoch 31/500 Iteration 6270 | Training Loss: 1.6392.
Epoch 31/500 Iteration 6280 | Training Loss: 1.6704.
Epoch 31/500 Iteration 6290 | Training Loss: 1.6262.
Epoch 31/500 Iteration 6300 | Training Loss: 1.6363.
Epoch 31/500 Iteration 6310 | Training Loss: 1.6296.
Epoch 31/500 Iteration 6320 | Training Loss: 1.6295.
Epoch 32/500 Iteration 6330 | Training Loss: 1.6538.
Epoch 32/500 Iteration 6340 | Training Loss: 1.6223.
Epoch 32/500 Iteration 6350 | Training Loss: 1.6233.
Epoch 32/500 Iteration 6360 | Training Loss: 1.5982.
Epoch 32/500 Iteration 6370 | Training Loss: 1.6420.
Epoch 32/500 Iteration 6380 | Training Loss: 1.5985.
Epoch 32/500 Iteration 6390 | Training Loss: 1.6369.
Epoch 32/500 Iteration 6400 | Training Loss: 1.6757.
Epoch 32/500 Iteration 6410 | Training Loss: 1.6202.
Epoch 32/500 Iteration 6420 | Training Loss: 1.6067.
Epoch 32/500 Iteration 6430 | Training Loss: 1

Epoch 39/500 Iteration 7800 | Training Loss: 1.6525.
Epoch 39/500 Iteration 7810 | Training Loss: 1.5801.
Epoch 39/500 Iteration 7820 | Training Loss: 1.6209.
Epoch 39/500 Iteration 7830 | Training Loss: 1.6291.
Epoch 39/500 Iteration 7840 | Training Loss: 1.5935.
Epoch 39/500 Iteration 7850 | Training Loss: 1.5782.
Epoch 39/500 Iteration 7860 | Training Loss: 1.5907.
Epoch 39/500 Iteration 7870 | Training Loss: 1.5186.
Epoch 39/500 Iteration 7880 | Training Loss: 1.5910.
Epoch 39/500 Iteration 7890 | Training Loss: 1.5947.
Epoch 39/500 Iteration 7900 | Training Loss: 1.5325.
Epoch 39/500 Iteration 7910 | Training Loss: 1.5824.
Epoch 39/500 Iteration 7920 | Training Loss: 1.5940.
Epoch 39/500 Iteration 7930 | Training Loss: 1.5832.
Epoch 39/500 Iteration 7940 | Training Loss: 1.5601.
Epoch 39/500 Iteration 7950 | Training Loss: 1.5769.
Epoch 40/500 Iteration 7960 | Training Loss: 1.6248.
Epoch 40/500 Iteration 7970 | Training Loss: 1.5819.
Epoch 40/500 Iteration 7980 | Training Loss: 1

Epoch 46/500 Iteration 9350 | Training Loss: 1.5438.
Epoch 46/500 Iteration 9360 | Training Loss: 1.5596.
Epoch 46/500 Iteration 9370 | Training Loss: 1.5326.
Epoch 46/500 Iteration 9380 | Training Loss: 1.5569.
Epoch 47/500 Iteration 9390 | Training Loss: 1.5738.
Epoch 47/500 Iteration 9400 | Training Loss: 1.5588.
Epoch 47/500 Iteration 9410 | Training Loss: 1.5438.
Epoch 47/500 Iteration 9420 | Training Loss: 1.5274.
Epoch 47/500 Iteration 9430 | Training Loss: 1.5698.
Epoch 47/500 Iteration 9440 | Training Loss: 1.5243.
Epoch 47/500 Iteration 9450 | Training Loss: 1.5461.
Epoch 47/500 Iteration 9460 | Training Loss: 1.5691.
Epoch 47/500 Iteration 9470 | Training Loss: 1.5456.
Epoch 47/500 Iteration 9480 | Training Loss: 1.5358.
Epoch 47/500 Iteration 9490 | Training Loss: 1.5445.
Epoch 47/500 Iteration 9500 | Training Loss: 1.5556.
Epoch 47/500 Iteration 9510 | Training Loss: 1.5417.
Epoch 47/500 Iteration 9520 | Training Loss: 1.5493.
Epoch 47/500 Iteration 9530 | Training Loss: 1

Epoch 54/500 Iteration 10880 | Training Loss: 1.5565.
Epoch 54/500 Iteration 10890 | Training Loss: 1.5709.
Epoch 54/500 Iteration 10900 | Training Loss: 1.5451.
Epoch 54/500 Iteration 10910 | Training Loss: 1.5129.
Epoch 54/500 Iteration 10920 | Training Loss: 1.5347.
Epoch 54/500 Iteration 10930 | Training Loss: 1.4765.
Epoch 54/500 Iteration 10940 | Training Loss: 1.5319.
Epoch 54/500 Iteration 10950 | Training Loss: 1.5266.
Epoch 54/500 Iteration 10960 | Training Loss: 1.4643.
Epoch 54/500 Iteration 10970 | Training Loss: 1.5256.
Epoch 54/500 Iteration 10980 | Training Loss: 1.5301.
Epoch 54/500 Iteration 10990 | Training Loss: 1.5257.
Epoch 54/500 Iteration 11000 | Training Loss: 1.5007.
Epoch 54/500 Iteration 11010 | Training Loss: 1.5407.
Epoch 55/500 Iteration 11020 | Training Loss: 1.5887.
Epoch 55/500 Iteration 11030 | Training Loss: 1.5088.
Epoch 55/500 Iteration 11040 | Training Loss: 1.5483.
Epoch 55/500 Iteration 11050 | Training Loss: 1.5164.
Epoch 55/500 Iteration 11060

Epoch 61/500 Iteration 12400 | Training Loss: 1.5350.
Epoch 61/500 Iteration 12410 | Training Loss: 1.5041.
Epoch 61/500 Iteration 12420 | Training Loss: 1.5177.
Epoch 61/500 Iteration 12430 | Training Loss: 1.4938.
Epoch 61/500 Iteration 12440 | Training Loss: 1.5146.
Epoch 62/500 Iteration 12450 | Training Loss: 1.5277.
Epoch 62/500 Iteration 12460 | Training Loss: 1.5025.
Epoch 62/500 Iteration 12470 | Training Loss: 1.4924.
Epoch 62/500 Iteration 12480 | Training Loss: 1.4895.
Epoch 62/500 Iteration 12490 | Training Loss: 1.5097.
Epoch 62/500 Iteration 12500 | Training Loss: 1.4911.
Epoch 62/500 Iteration 12510 | Training Loss: 1.5171.
Epoch 62/500 Iteration 12520 | Training Loss: 1.5392.
Epoch 62/500 Iteration 12530 | Training Loss: 1.5108.
Epoch 62/500 Iteration 12540 | Training Loss: 1.4880.
Epoch 62/500 Iteration 12550 | Training Loss: 1.4910.
Epoch 62/500 Iteration 12560 | Training Loss: 1.5114.
Epoch 62/500 Iteration 12570 | Training Loss: 1.4952.
Epoch 62/500 Iteration 12580

Epoch 69/500 Iteration 13920 | Training Loss: 1.5544.
Epoch 69/500 Iteration 13930 | Training Loss: 1.4934.
Epoch 69/500 Iteration 13940 | Training Loss: 1.5198.
Epoch 69/500 Iteration 13950 | Training Loss: 1.5238.
Epoch 69/500 Iteration 13960 | Training Loss: 1.5031.
Epoch 69/500 Iteration 13970 | Training Loss: 1.4683.
Epoch 69/500 Iteration 13980 | Training Loss: 1.4863.
Epoch 69/500 Iteration 13990 | Training Loss: 1.4249.
Epoch 69/500 Iteration 14000 | Training Loss: 1.4837.
Epoch 69/500 Iteration 14010 | Training Loss: 1.4900.
Epoch 69/500 Iteration 14020 | Training Loss: 1.4444.
Epoch 69/500 Iteration 14030 | Training Loss: 1.4961.
Epoch 69/500 Iteration 14040 | Training Loss: 1.4796.
Epoch 69/500 Iteration 14050 | Training Loss: 1.4957.
Epoch 69/500 Iteration 14060 | Training Loss: 1.4598.
Epoch 69/500 Iteration 14070 | Training Loss: 1.4879.
Epoch 70/500 Iteration 14080 | Training Loss: 1.5521.
Epoch 70/500 Iteration 14090 | Training Loss: 1.4874.
Epoch 70/500 Iteration 14100

Epoch 76/500 Iteration 15440 | Training Loss: 1.4676.
Epoch 76/500 Iteration 15450 | Training Loss: 1.4716.
Epoch 76/500 Iteration 15460 | Training Loss: 1.5241.
Epoch 76/500 Iteration 15470 | Training Loss: 1.4643.
Epoch 76/500 Iteration 15480 | Training Loss: 1.4962.
Epoch 76/500 Iteration 15490 | Training Loss: 1.4681.
Epoch 76/500 Iteration 15500 | Training Loss: 1.4779.
Epoch 77/500 Iteration 15510 | Training Loss: 1.4906.
Epoch 77/500 Iteration 15520 | Training Loss: 1.4874.
Epoch 77/500 Iteration 15530 | Training Loss: 1.4600.
Epoch 77/500 Iteration 15540 | Training Loss: 1.4449.
Epoch 77/500 Iteration 15550 | Training Loss: 1.4745.
Epoch 77/500 Iteration 15560 | Training Loss: 1.4546.
Epoch 77/500 Iteration 15570 | Training Loss: 1.4792.
Epoch 77/500 Iteration 15580 | Training Loss: 1.4979.
Epoch 77/500 Iteration 15590 | Training Loss: 1.4646.
Epoch 77/500 Iteration 15600 | Training Loss: 1.4653.
Epoch 77/500 Iteration 15610 | Training Loss: 1.4596.
Epoch 77/500 Iteration 15620

Epoch 84/500 Iteration 16960 | Training Loss: 1.4614.
Epoch 84/500 Iteration 16970 | Training Loss: 1.4482.
Epoch 84/500 Iteration 16980 | Training Loss: 1.5380.
Epoch 84/500 Iteration 16990 | Training Loss: 1.4577.
Epoch 84/500 Iteration 17000 | Training Loss: 1.4850.
Epoch 84/500 Iteration 17010 | Training Loss: 1.4975.
Epoch 84/500 Iteration 17020 | Training Loss: 1.4914.
Epoch 84/500 Iteration 17030 | Training Loss: 1.4558.
Epoch 84/500 Iteration 17040 | Training Loss: 1.4629.
Epoch 84/500 Iteration 17050 | Training Loss: 1.4034.
Epoch 84/500 Iteration 17060 | Training Loss: 1.4555.
Epoch 84/500 Iteration 17070 | Training Loss: 1.4570.
Epoch 84/500 Iteration 17080 | Training Loss: 1.4042.
Epoch 84/500 Iteration 17090 | Training Loss: 1.4696.
Epoch 84/500 Iteration 17100 | Training Loss: 1.4712.
Epoch 84/500 Iteration 17110 | Training Loss: 1.4625.
Epoch 84/500 Iteration 17120 | Training Loss: 1.4499.
Epoch 84/500 Iteration 17130 | Training Loss: 1.4787.
Epoch 85/500 Iteration 17140

Epoch 91/500 Iteration 18480 | Training Loss: 1.4415.
Epoch 91/500 Iteration 18490 | Training Loss: 1.4405.
Epoch 91/500 Iteration 18500 | Training Loss: 1.4491.
Epoch 91/500 Iteration 18510 | Training Loss: 1.4540.
Epoch 91/500 Iteration 18520 | Training Loss: 1.4852.
Epoch 91/500 Iteration 18530 | Training Loss: 1.4358.
Epoch 91/500 Iteration 18540 | Training Loss: 1.4798.
Epoch 91/500 Iteration 18550 | Training Loss: 1.4486.
Epoch 91/500 Iteration 18560 | Training Loss: 1.4646.
Epoch 92/500 Iteration 18570 | Training Loss: 1.4754.
Epoch 92/500 Iteration 18580 | Training Loss: 1.4690.
Epoch 92/500 Iteration 18590 | Training Loss: 1.4373.
Epoch 92/500 Iteration 18600 | Training Loss: 1.4313.
Epoch 92/500 Iteration 18610 | Training Loss: 1.4530.
Epoch 92/500 Iteration 18620 | Training Loss: 1.4328.
Epoch 92/500 Iteration 18630 | Training Loss: 1.4624.
Epoch 92/500 Iteration 18640 | Training Loss: 1.4772.
Epoch 92/500 Iteration 18650 | Training Loss: 1.4482.
Epoch 92/500 Iteration 18660

Epoch 99/500 Iteration 20000 | Training Loss: 1.4545.
Epoch 99/500 Iteration 20010 | Training Loss: 1.3748.
Epoch 99/500 Iteration 20020 | Training Loss: 1.4558.
Epoch 99/500 Iteration 20030 | Training Loss: 1.4426.
Epoch 99/500 Iteration 20040 | Training Loss: 1.5257.
Epoch 99/500 Iteration 20050 | Training Loss: 1.4409.
Epoch 99/500 Iteration 20060 | Training Loss: 1.4656.
Epoch 99/500 Iteration 20070 | Training Loss: 1.4738.
Epoch 99/500 Iteration 20080 | Training Loss: 1.4452.
Epoch 99/500 Iteration 20090 | Training Loss: 1.4313.
Epoch 99/500 Iteration 20100 | Training Loss: 1.4452.
Epoch 99/500 Iteration 20110 | Training Loss: 1.3882.
Epoch 99/500 Iteration 20120 | Training Loss: 1.4411.
Epoch 99/500 Iteration 20130 | Training Loss: 1.4423.
Epoch 99/500 Iteration 20140 | Training Loss: 1.3864.
Epoch 99/500 Iteration 20150 | Training Loss: 1.4451.
Epoch 99/500 Iteration 20160 | Training Loss: 1.4336.
Epoch 99/500 Iteration 20170 | Training Loss: 1.4396.
Epoch 99/500 Iteration 20180

Epoch 106/500 Iteration 21500 | Training Loss: 1.3946.
Epoch 106/500 Iteration 21510 | Training Loss: 1.4172.
Epoch 106/500 Iteration 21520 | Training Loss: 1.4315.
Epoch 106/500 Iteration 21530 | Training Loss: 1.3739.
Epoch 106/500 Iteration 21540 | Training Loss: 1.4289.
Epoch 106/500 Iteration 21550 | Training Loss: 1.4177.
Epoch 106/500 Iteration 21560 | Training Loss: 1.4241.
Epoch 106/500 Iteration 21570 | Training Loss: 1.4386.
Epoch 106/500 Iteration 21580 | Training Loss: 1.4615.
Epoch 106/500 Iteration 21590 | Training Loss: 1.4324.
Epoch 106/500 Iteration 21600 | Training Loss: 1.4574.
Epoch 106/500 Iteration 21610 | Training Loss: 1.4268.
Epoch 106/500 Iteration 21620 | Training Loss: 1.4488.
Epoch 107/500 Iteration 21630 | Training Loss: 1.4539.
Epoch 107/500 Iteration 21640 | Training Loss: 1.4470.
Epoch 107/500 Iteration 21650 | Training Loss: 1.4135.
Epoch 107/500 Iteration 21660 | Training Loss: 1.4099.
Epoch 107/500 Iteration 21670 | Training Loss: 1.4398.
Epoch 107/

Epoch 113/500 Iteration 22990 | Training Loss: 1.4123.
Epoch 113/500 Iteration 23000 | Training Loss: 1.4061.
Epoch 113/500 Iteration 23010 | Training Loss: 1.4695.
Epoch 113/500 Iteration 23020 | Training Loss: 1.4217.
Epoch 113/500 Iteration 23030 | Training Loss: 1.4433.
Epoch 113/500 Iteration 23040 | Training Loss: 1.4523.
Epoch 113/500 Iteration 23050 | Training Loss: 1.4867.
Epoch 114/500 Iteration 23060 | Training Loss: 1.4299.
Epoch 114/500 Iteration 23070 | Training Loss: 1.3578.
Epoch 114/500 Iteration 23080 | Training Loss: 1.4274.
Epoch 114/500 Iteration 23090 | Training Loss: 1.4200.
Epoch 114/500 Iteration 23100 | Training Loss: 1.5031.
Epoch 114/500 Iteration 23110 | Training Loss: 1.4326.
Epoch 114/500 Iteration 23120 | Training Loss: 1.4516.
Epoch 114/500 Iteration 23130 | Training Loss: 1.4601.
Epoch 114/500 Iteration 23140 | Training Loss: 1.4358.
Epoch 114/500 Iteration 23150 | Training Loss: 1.4269.
Epoch 114/500 Iteration 23160 | Training Loss: 1.4306.
Epoch 114/

Epoch 120/500 Iteration 24480 | Training Loss: 1.4470.
Epoch 121/500 Iteration 24490 | Training Loss: 1.4311.
Epoch 121/500 Iteration 24500 | Training Loss: 1.4062.
Epoch 121/500 Iteration 24510 | Training Loss: 1.3936.
Epoch 121/500 Iteration 24520 | Training Loss: 1.4509.
Epoch 121/500 Iteration 24530 | Training Loss: 1.4128.
Epoch 121/500 Iteration 24540 | Training Loss: 1.4353.
Epoch 121/500 Iteration 24550 | Training Loss: 1.4386.
Epoch 121/500 Iteration 24560 | Training Loss: 1.3898.
Epoch 121/500 Iteration 24570 | Training Loss: 1.4082.
Epoch 121/500 Iteration 24580 | Training Loss: 1.4258.
Epoch 121/500 Iteration 24590 | Training Loss: 1.3690.
Epoch 121/500 Iteration 24600 | Training Loss: 1.4217.
Epoch 121/500 Iteration 24610 | Training Loss: 1.4029.
Epoch 121/500 Iteration 24620 | Training Loss: 1.4262.
Epoch 121/500 Iteration 24630 | Training Loss: 1.4157.
Epoch 121/500 Iteration 24640 | Training Loss: 1.4585.
Epoch 121/500 Iteration 24650 | Training Loss: 1.4179.
Epoch 121/

Epoch 128/500 Iteration 25970 | Training Loss: 1.4003.
Epoch 128/500 Iteration 25980 | Training Loss: 1.3933.
Epoch 128/500 Iteration 25990 | Training Loss: 1.4432.
Epoch 128/500 Iteration 26000 | Training Loss: 1.4100.
Epoch 128/500 Iteration 26010 | Training Loss: 1.3681.
Epoch 128/500 Iteration 26020 | Training Loss: 1.3846.
Epoch 128/500 Iteration 26030 | Training Loss: 1.4062.
Epoch 128/500 Iteration 26040 | Training Loss: 1.3740.
Epoch 128/500 Iteration 26050 | Training Loss: 1.3915.
Epoch 128/500 Iteration 26060 | Training Loss: 1.3858.
Epoch 128/500 Iteration 26070 | Training Loss: 1.4640.
Epoch 128/500 Iteration 26080 | Training Loss: 1.4044.
Epoch 128/500 Iteration 26090 | Training Loss: 1.4270.
Epoch 128/500 Iteration 26100 | Training Loss: 1.4452.
Epoch 128/500 Iteration 26110 | Training Loss: 1.4701.
Epoch 129/500 Iteration 26120 | Training Loss: 1.4114.
Epoch 129/500 Iteration 26130 | Training Loss: 1.3488.
Epoch 129/500 Iteration 26140 | Training Loss: 1.4095.
Epoch 129/

Epoch 135/500 Iteration 27460 | Training Loss: 1.4169.
Epoch 135/500 Iteration 27470 | Training Loss: 1.4104.
Epoch 135/500 Iteration 27480 | Training Loss: 1.4025.
Epoch 135/500 Iteration 27490 | Training Loss: 1.4188.
Epoch 135/500 Iteration 27500 | Training Loss: 1.3998.
Epoch 135/500 Iteration 27510 | Training Loss: 1.4607.
Epoch 135/500 Iteration 27520 | Training Loss: 1.4118.
Epoch 135/500 Iteration 27530 | Training Loss: 1.3937.
Epoch 135/500 Iteration 27540 | Training Loss: 1.4451.
Epoch 136/500 Iteration 27550 | Training Loss: 1.4431.
Epoch 136/500 Iteration 27560 | Training Loss: 1.3947.
Epoch 136/500 Iteration 27570 | Training Loss: 1.3884.
Epoch 136/500 Iteration 27580 | Training Loss: 1.4470.
Epoch 136/500 Iteration 27590 | Training Loss: 1.4079.
Epoch 136/500 Iteration 27600 | Training Loss: 1.4325.
Epoch 136/500 Iteration 27610 | Training Loss: 1.4247.
Epoch 136/500 Iteration 27620 | Training Loss: 1.3603.
Epoch 136/500 Iteration 27630 | Training Loss: 1.3860.
Epoch 136/

Epoch 142/500 Iteration 28950 | Training Loss: 1.4304.
Epoch 142/500 Iteration 28960 | Training Loss: 1.4570.
Epoch 143/500 Iteration 28970 | Training Loss: 1.4376.
Epoch 143/500 Iteration 28980 | Training Loss: 1.4347.
Epoch 143/500 Iteration 28990 | Training Loss: 1.4571.
Epoch 143/500 Iteration 29000 | Training Loss: 1.3738.
Epoch 143/500 Iteration 29010 | Training Loss: 1.4180.
Epoch 143/500 Iteration 29020 | Training Loss: 1.4020.
Epoch 143/500 Iteration 29030 | Training Loss: 1.3937.
Epoch 143/500 Iteration 29040 | Training Loss: 1.3768.
Epoch 143/500 Iteration 29050 | Training Loss: 1.4338.
Epoch 143/500 Iteration 29060 | Training Loss: 1.4047.
Epoch 143/500 Iteration 29070 | Training Loss: 1.3599.
Epoch 143/500 Iteration 29080 | Training Loss: 1.3871.
Epoch 143/500 Iteration 29090 | Training Loss: 1.4027.
Epoch 143/500 Iteration 29100 | Training Loss: 1.3678.
Epoch 143/500 Iteration 29110 | Training Loss: 1.3806.
Epoch 143/500 Iteration 29120 | Training Loss: 1.3702.
Epoch 143/

Epoch 150/500 Iteration 30440 | Training Loss: 1.3893.
Epoch 150/500 Iteration 30450 | Training Loss: 1.4351.
Epoch 150/500 Iteration 30460 | Training Loss: 1.3992.
Epoch 150/500 Iteration 30470 | Training Loss: 1.4051.
Epoch 150/500 Iteration 30480 | Training Loss: 1.3986.
Epoch 150/500 Iteration 30490 | Training Loss: 1.3676.
Epoch 150/500 Iteration 30500 | Training Loss: 1.3847.
Epoch 150/500 Iteration 30510 | Training Loss: 1.3904.
Epoch 150/500 Iteration 30520 | Training Loss: 1.4141.
Epoch 150/500 Iteration 30530 | Training Loss: 1.4160.
Epoch 150/500 Iteration 30540 | Training Loss: 1.3720.
Epoch 150/500 Iteration 30550 | Training Loss: 1.4081.
Epoch 150/500 Iteration 30560 | Training Loss: 1.4032.
Epoch 150/500 Iteration 30570 | Training Loss: 1.4335.
Epoch 150/500 Iteration 30580 | Training Loss: 1.4038.
Epoch 150/500 Iteration 30590 | Training Loss: 1.3782.
Epoch 150/500 Iteration 30600 | Training Loss: 1.4379.
Epoch 151/500 Iteration 30610 | Training Loss: 1.4109.
Epoch 151/

Epoch 157/500 Iteration 31930 | Training Loss: 1.3815.
Epoch 157/500 Iteration 31940 | Training Loss: 1.4216.
Epoch 157/500 Iteration 31950 | Training Loss: 1.3876.
Epoch 157/500 Iteration 31960 | Training Loss: 1.4025.
Epoch 157/500 Iteration 31970 | Training Loss: 1.3584.
Epoch 157/500 Iteration 31980 | Training Loss: 1.4242.
Epoch 157/500 Iteration 31990 | Training Loss: 1.3997.
Epoch 157/500 Iteration 32000 | Training Loss: 1.4240.
Epoch 157/500 Iteration 32010 | Training Loss: 1.4252.
Epoch 157/500 Iteration 32020 | Training Loss: 1.4642.
Epoch 158/500 Iteration 32030 | Training Loss: 1.4338.
Epoch 158/500 Iteration 32040 | Training Loss: 1.4241.
Epoch 158/500 Iteration 32050 | Training Loss: 1.4445.
Epoch 158/500 Iteration 32060 | Training Loss: 1.3794.
Epoch 158/500 Iteration 32070 | Training Loss: 1.3970.
Epoch 158/500 Iteration 32080 | Training Loss: 1.3953.
Epoch 158/500 Iteration 32090 | Training Loss: 1.3867.
Epoch 158/500 Iteration 32100 | Training Loss: 1.3752.
Epoch 158/

Epoch 164/500 Iteration 33420 | Training Loss: 1.3938.
Epoch 164/500 Iteration 33430 | Training Loss: 1.3940.
Epoch 164/500 Iteration 33440 | Training Loss: 1.3750.
Epoch 164/500 Iteration 33450 | Training Loss: 1.4034.
Epoch 165/500 Iteration 33460 | Training Loss: 1.4391.
Epoch 165/500 Iteration 33470 | Training Loss: 1.3865.
Epoch 165/500 Iteration 33480 | Training Loss: 1.4243.
Epoch 165/500 Iteration 33490 | Training Loss: 1.3919.
Epoch 165/500 Iteration 33500 | Training Loss: 1.3836.
Epoch 165/500 Iteration 33510 | Training Loss: 1.4350.
Epoch 165/500 Iteration 33520 | Training Loss: 1.3819.
Epoch 165/500 Iteration 33530 | Training Loss: 1.3891.
Epoch 165/500 Iteration 33540 | Training Loss: 1.3949.
Epoch 165/500 Iteration 33550 | Training Loss: 1.3697.
Epoch 165/500 Iteration 33560 | Training Loss: 1.3726.
Epoch 165/500 Iteration 33570 | Training Loss: 1.3917.
Epoch 165/500 Iteration 33580 | Training Loss: 1.3963.
Epoch 165/500 Iteration 33590 | Training Loss: 1.3961.
Epoch 165/

Epoch 172/500 Iteration 34910 | Training Loss: 1.3766.
Epoch 172/500 Iteration 34920 | Training Loss: 1.3867.
Epoch 172/500 Iteration 34930 | Training Loss: 1.3838.
Epoch 172/500 Iteration 34940 | Training Loss: 1.3814.
Epoch 172/500 Iteration 34950 | Training Loss: 1.3999.
Epoch 172/500 Iteration 34960 | Training Loss: 1.4318.
Epoch 172/500 Iteration 34970 | Training Loss: 1.3930.
Epoch 172/500 Iteration 34980 | Training Loss: 1.3868.
Epoch 172/500 Iteration 34990 | Training Loss: 1.3923.
Epoch 172/500 Iteration 35000 | Training Loss: 1.3957.
Epoch 172/500 Iteration 35010 | Training Loss: 1.3700.
Epoch 172/500 Iteration 35020 | Training Loss: 1.3956.
Epoch 172/500 Iteration 35030 | Training Loss: 1.3593.
Epoch 172/500 Iteration 35040 | Training Loss: 1.4199.
Epoch 172/500 Iteration 35050 | Training Loss: 1.3930.
Epoch 172/500 Iteration 35060 | Training Loss: 1.4151.
Epoch 172/500 Iteration 35070 | Training Loss: 1.4234.
Epoch 172/500 Iteration 35080 | Training Loss: 1.4520.
Epoch 173/

Epoch 179/500 Iteration 36400 | Training Loss: 1.4049.
Epoch 179/500 Iteration 36410 | Training Loss: 1.3825.
Epoch 179/500 Iteration 36420 | Training Loss: 1.3834.
Epoch 179/500 Iteration 36430 | Training Loss: 1.3455.
Epoch 179/500 Iteration 36440 | Training Loss: 1.3932.
Epoch 179/500 Iteration 36450 | Training Loss: 1.3977.
Epoch 179/500 Iteration 36460 | Training Loss: 1.3432.
Epoch 179/500 Iteration 36470 | Training Loss: 1.3983.
Epoch 179/500 Iteration 36480 | Training Loss: 1.3829.
Epoch 179/500 Iteration 36490 | Training Loss: 1.3852.
Epoch 179/500 Iteration 36500 | Training Loss: 1.3839.
Epoch 179/500 Iteration 36510 | Training Loss: 1.4098.
Epoch 180/500 Iteration 36520 | Training Loss: 1.4417.
Epoch 180/500 Iteration 36530 | Training Loss: 1.3904.
Epoch 180/500 Iteration 36540 | Training Loss: 1.4099.
Epoch 180/500 Iteration 36550 | Training Loss: 1.3899.
Epoch 180/500 Iteration 36560 | Training Loss: 1.3746.
Epoch 180/500 Iteration 36570 | Training Loss: 1.4296.
Epoch 180/

Epoch 186/500 Iteration 37890 | Training Loss: 1.3903.
Epoch 186/500 Iteration 37900 | Training Loss: 1.4265.
Epoch 186/500 Iteration 37910 | Training Loss: 1.3825.
Epoch 186/500 Iteration 37920 | Training Loss: 1.4169.
Epoch 186/500 Iteration 37930 | Training Loss: 1.3943.
Epoch 186/500 Iteration 37940 | Training Loss: 1.3985.
Epoch 187/500 Iteration 37950 | Training Loss: 1.3892.
Epoch 187/500 Iteration 37960 | Training Loss: 1.4057.
Epoch 187/500 Iteration 37970 | Training Loss: 1.3794.
Epoch 187/500 Iteration 37980 | Training Loss: 1.3711.
Epoch 187/500 Iteration 37990 | Training Loss: 1.3852.
Epoch 187/500 Iteration 38000 | Training Loss: 1.3669.
Epoch 187/500 Iteration 38010 | Training Loss: 1.3921.
Epoch 187/500 Iteration 38020 | Training Loss: 1.4034.
Epoch 187/500 Iteration 38030 | Training Loss: 1.3866.
Epoch 187/500 Iteration 38040 | Training Loss: 1.3664.
Epoch 187/500 Iteration 38050 | Training Loss: 1.3883.
Epoch 187/500 Iteration 38060 | Training Loss: 1.3864.
Epoch 187/

Epoch 194/500 Iteration 39380 | Training Loss: 1.3955.
Epoch 194/500 Iteration 39390 | Training Loss: 1.3094.
Epoch 194/500 Iteration 39400 | Training Loss: 1.3865.
Epoch 194/500 Iteration 39410 | Training Loss: 1.3695.
Epoch 194/500 Iteration 39420 | Training Loss: 1.4503.
Epoch 194/500 Iteration 39430 | Training Loss: 1.3807.
Epoch 194/500 Iteration 39440 | Training Loss: 1.3976.
Epoch 194/500 Iteration 39450 | Training Loss: 1.4166.
Epoch 194/500 Iteration 39460 | Training Loss: 1.4024.
Epoch 194/500 Iteration 39470 | Training Loss: 1.3845.
Epoch 194/500 Iteration 39480 | Training Loss: 1.3890.
Epoch 194/500 Iteration 39490 | Training Loss: 1.3196.
Epoch 194/500 Iteration 39500 | Training Loss: 1.3859.
Epoch 194/500 Iteration 39510 | Training Loss: 1.3908.
Epoch 194/500 Iteration 39520 | Training Loss: 1.3282.
Epoch 194/500 Iteration 39530 | Training Loss: 1.4031.
Epoch 194/500 Iteration 39540 | Training Loss: 1.3716.
Epoch 194/500 Iteration 39550 | Training Loss: 1.3746.
Epoch 194/

Epoch 201/500 Iteration 40870 | Training Loss: 1.4055.
Epoch 201/500 Iteration 40880 | Training Loss: 1.3483.
Epoch 201/500 Iteration 40890 | Training Loss: 1.3677.
Epoch 201/500 Iteration 40900 | Training Loss: 1.3791.
Epoch 201/500 Iteration 40910 | Training Loss: 1.3265.
Epoch 201/500 Iteration 40920 | Training Loss: 1.3775.
Epoch 201/500 Iteration 40930 | Training Loss: 1.3651.
Epoch 201/500 Iteration 40940 | Training Loss: 1.3841.
Epoch 201/500 Iteration 40950 | Training Loss: 1.3880.
Epoch 201/500 Iteration 40960 | Training Loss: 1.4162.
Epoch 201/500 Iteration 40970 | Training Loss: 1.3755.
Epoch 201/500 Iteration 40980 | Training Loss: 1.4192.
Epoch 201/500 Iteration 40990 | Training Loss: 1.3743.
Epoch 201/500 Iteration 41000 | Training Loss: 1.4017.
Epoch 202/500 Iteration 41010 | Training Loss: 1.3870.
Epoch 202/500 Iteration 41020 | Training Loss: 1.3885.
Epoch 202/500 Iteration 41030 | Training Loss: 1.3655.
Epoch 202/500 Iteration 41040 | Training Loss: 1.3748.
Epoch 202/

Epoch 208/500 Iteration 42360 | Training Loss: 1.3402.
Epoch 208/500 Iteration 42370 | Training Loss: 1.3521.
Epoch 208/500 Iteration 42380 | Training Loss: 1.3556.
Epoch 208/500 Iteration 42390 | Training Loss: 1.4101.
Epoch 208/500 Iteration 42400 | Training Loss: 1.3723.
Epoch 208/500 Iteration 42410 | Training Loss: 1.3933.
Epoch 208/500 Iteration 42420 | Training Loss: 1.3973.
Epoch 208/500 Iteration 42430 | Training Loss: 1.4374.
Epoch 209/500 Iteration 42440 | Training Loss: 1.3925.
Epoch 209/500 Iteration 42450 | Training Loss: 1.3070.
Epoch 209/500 Iteration 42460 | Training Loss: 1.3768.
Epoch 209/500 Iteration 42470 | Training Loss: 1.3823.
Epoch 209/500 Iteration 42480 | Training Loss: 1.4482.
Epoch 209/500 Iteration 42490 | Training Loss: 1.3908.
Epoch 209/500 Iteration 42500 | Training Loss: 1.3914.
Epoch 209/500 Iteration 42510 | Training Loss: 1.4265.
Epoch 209/500 Iteration 42520 | Training Loss: 1.3917.
Epoch 209/500 Iteration 42530 | Training Loss: 1.3700.
Epoch 209/

Epoch 215/500 Iteration 43850 | Training Loss: 1.3520.
Epoch 215/500 Iteration 43860 | Training Loss: 1.4152.
Epoch 216/500 Iteration 43870 | Training Loss: 1.3958.
Epoch 216/500 Iteration 43880 | Training Loss: 1.3665.
Epoch 216/500 Iteration 43890 | Training Loss: 1.3404.
Epoch 216/500 Iteration 43900 | Training Loss: 1.4167.
Epoch 216/500 Iteration 43910 | Training Loss: 1.3937.
Epoch 216/500 Iteration 43920 | Training Loss: 1.3937.
Epoch 216/500 Iteration 43930 | Training Loss: 1.3905.
Epoch 216/500 Iteration 43940 | Training Loss: 1.3462.
Epoch 216/500 Iteration 43950 | Training Loss: 1.3722.
Epoch 216/500 Iteration 43960 | Training Loss: 1.3713.
Epoch 216/500 Iteration 43970 | Training Loss: 1.3419.
Epoch 216/500 Iteration 43980 | Training Loss: 1.3687.
Epoch 216/500 Iteration 43990 | Training Loss: 1.3661.
Epoch 216/500 Iteration 44000 | Training Loss: 1.3720.
Epoch 216/500 Iteration 44010 | Training Loss: 1.3848.
Epoch 216/500 Iteration 44020 | Training Loss: 1.4093.
Epoch 216/

Epoch 223/500 Iteration 45340 | Training Loss: 1.3723.
Epoch 223/500 Iteration 45350 | Training Loss: 1.3553.
Epoch 223/500 Iteration 45360 | Training Loss: 1.3607.
Epoch 223/500 Iteration 45370 | Training Loss: 1.3993.
Epoch 223/500 Iteration 45380 | Training Loss: 1.3765.
Epoch 223/500 Iteration 45390 | Training Loss: 1.3145.
Epoch 223/500 Iteration 45400 | Training Loss: 1.3422.
Epoch 223/500 Iteration 45410 | Training Loss: 1.3602.
Epoch 223/500 Iteration 45420 | Training Loss: 1.3309.
Epoch 223/500 Iteration 45430 | Training Loss: 1.3430.
Epoch 223/500 Iteration 45440 | Training Loss: 1.3402.
Epoch 223/500 Iteration 45450 | Training Loss: 1.4008.
Epoch 223/500 Iteration 45460 | Training Loss: 1.3694.
Epoch 223/500 Iteration 45470 | Training Loss: 1.4008.
Epoch 223/500 Iteration 45480 | Training Loss: 1.4116.
Epoch 223/500 Iteration 45490 | Training Loss: 1.4218.
Epoch 224/500 Iteration 45500 | Training Loss: 1.3886.
Epoch 224/500 Iteration 45510 | Training Loss: 1.3160.
Epoch 224/

Epoch 230/500 Iteration 46830 | Training Loss: 1.3558.
Epoch 230/500 Iteration 46840 | Training Loss: 1.3799.
Epoch 230/500 Iteration 46850 | Training Loss: 1.3782.
Epoch 230/500 Iteration 46860 | Training Loss: 1.3545.
Epoch 230/500 Iteration 46870 | Training Loss: 1.3699.
Epoch 230/500 Iteration 46880 | Training Loss: 1.3769.
Epoch 230/500 Iteration 46890 | Training Loss: 1.4061.
Epoch 230/500 Iteration 46900 | Training Loss: 1.3605.
Epoch 230/500 Iteration 46910 | Training Loss: 1.3474.
Epoch 230/500 Iteration 46920 | Training Loss: 1.4000.
Epoch 231/500 Iteration 46930 | Training Loss: 1.3858.
Epoch 231/500 Iteration 46940 | Training Loss: 1.3512.
Epoch 231/500 Iteration 46950 | Training Loss: 1.3459.
Epoch 231/500 Iteration 46960 | Training Loss: 1.3983.
Epoch 231/500 Iteration 46970 | Training Loss: 1.3629.
Epoch 231/500 Iteration 46980 | Training Loss: 1.3866.
Epoch 231/500 Iteration 46990 | Training Loss: 1.3788.
Epoch 231/500 Iteration 47000 | Training Loss: 1.3335.
Epoch 231/

Epoch 237/500 Iteration 48320 | Training Loss: 1.4002.
Epoch 237/500 Iteration 48330 | Training Loss: 1.4020.
Epoch 237/500 Iteration 48340 | Training Loss: 1.4286.
Epoch 238/500 Iteration 48350 | Training Loss: 1.4065.
Epoch 238/500 Iteration 48360 | Training Loss: 1.3948.
Epoch 238/500 Iteration 48370 | Training Loss: 1.4226.
Epoch 238/500 Iteration 48380 | Training Loss: 1.3532.
Epoch 238/500 Iteration 48390 | Training Loss: 1.3792.
Epoch 238/500 Iteration 48400 | Training Loss: 1.3710.
Epoch 238/500 Iteration 48410 | Training Loss: 1.3435.
Epoch 238/500 Iteration 48420 | Training Loss: 1.3638.
Epoch 238/500 Iteration 48430 | Training Loss: 1.4030.
Epoch 238/500 Iteration 48440 | Training Loss: 1.3736.
Epoch 238/500 Iteration 48450 | Training Loss: 1.3139.
Epoch 238/500 Iteration 48460 | Training Loss: 1.3362.
Epoch 238/500 Iteration 48470 | Training Loss: 1.3655.
Epoch 238/500 Iteration 48480 | Training Loss: 1.3303.
Epoch 238/500 Iteration 48490 | Training Loss: 1.3472.
Epoch 238/

Epoch 245/500 Iteration 49810 | Training Loss: 1.3647.
Epoch 245/500 Iteration 49820 | Training Loss: 1.3643.
Epoch 245/500 Iteration 49830 | Training Loss: 1.4166.
Epoch 245/500 Iteration 49840 | Training Loss: 1.3637.
Epoch 245/500 Iteration 49850 | Training Loss: 1.3605.
Epoch 245/500 Iteration 49860 | Training Loss: 1.3721.
Epoch 245/500 Iteration 49870 | Training Loss: 1.3533.
Epoch 245/500 Iteration 49880 | Training Loss: 1.3374.
Epoch 245/500 Iteration 49890 | Training Loss: 1.3627.
Epoch 245/500 Iteration 49900 | Training Loss: 1.3871.
Epoch 245/500 Iteration 49910 | Training Loss: 1.3784.
Epoch 245/500 Iteration 49920 | Training Loss: 1.3490.
Epoch 245/500 Iteration 49930 | Training Loss: 1.3734.
Epoch 245/500 Iteration 49940 | Training Loss: 1.3651.
Epoch 245/500 Iteration 49950 | Training Loss: 1.4175.
Epoch 245/500 Iteration 49960 | Training Loss: 1.3595.
Epoch 245/500 Iteration 49970 | Training Loss: 1.3637.
Epoch 245/500 Iteration 49980 | Training Loss: 1.4049.
Epoch 246/

Epoch 252/500 Iteration 51300 | Training Loss: 1.3706.
Epoch 252/500 Iteration 51310 | Training Loss: 1.3531.
Epoch 252/500 Iteration 51320 | Training Loss: 1.3810.
Epoch 252/500 Iteration 51330 | Training Loss: 1.3608.
Epoch 252/500 Iteration 51340 | Training Loss: 1.3572.
Epoch 252/500 Iteration 51350 | Training Loss: 1.3300.
Epoch 252/500 Iteration 51360 | Training Loss: 1.3869.
Epoch 252/500 Iteration 51370 | Training Loss: 1.3824.
Epoch 252/500 Iteration 51380 | Training Loss: 1.3971.
Epoch 252/500 Iteration 51390 | Training Loss: 1.3860.
Epoch 252/500 Iteration 51400 | Training Loss: 1.4390.
Epoch 253/500 Iteration 51410 | Training Loss: 1.4066.
Epoch 253/500 Iteration 51420 | Training Loss: 1.3987.
Epoch 253/500 Iteration 51430 | Training Loss: 1.4148.
Epoch 253/500 Iteration 51440 | Training Loss: 1.3495.
Epoch 253/500 Iteration 51450 | Training Loss: 1.3729.
Epoch 253/500 Iteration 51460 | Training Loss: 1.3759.
Epoch 253/500 Iteration 51470 | Training Loss: 1.3558.
Epoch 253/

Epoch 259/500 Iteration 52790 | Training Loss: 1.3770.
Epoch 259/500 Iteration 52800 | Training Loss: 1.3689.
Epoch 259/500 Iteration 52810 | Training Loss: 1.3537.
Epoch 259/500 Iteration 52820 | Training Loss: 1.3516.
Epoch 259/500 Iteration 52830 | Training Loss: 1.3950.
Epoch 260/500 Iteration 52840 | Training Loss: 1.4184.
Epoch 260/500 Iteration 52850 | Training Loss: 1.3498.
Epoch 260/500 Iteration 52860 | Training Loss: 1.3998.
Epoch 260/500 Iteration 52870 | Training Loss: 1.3493.
Epoch 260/500 Iteration 52880 | Training Loss: 1.3703.
Epoch 260/500 Iteration 52890 | Training Loss: 1.4069.
Epoch 260/500 Iteration 52900 | Training Loss: 1.3577.
Epoch 260/500 Iteration 52910 | Training Loss: 1.3660.
Epoch 260/500 Iteration 52920 | Training Loss: 1.3748.
Epoch 260/500 Iteration 52930 | Training Loss: 1.3411.
Epoch 260/500 Iteration 52940 | Training Loss: 1.3487.
Epoch 260/500 Iteration 52950 | Training Loss: 1.3637.
Epoch 260/500 Iteration 52960 | Training Loss: 1.3784.
Epoch 260/

Epoch 267/500 Iteration 54280 | Training Loss: 1.3882.
Epoch 267/500 Iteration 54290 | Training Loss: 1.3647.
Epoch 267/500 Iteration 54300 | Training Loss: 1.3376.
Epoch 267/500 Iteration 54310 | Training Loss: 1.3537.
Epoch 267/500 Iteration 54320 | Training Loss: 1.3501.
Epoch 267/500 Iteration 54330 | Training Loss: 1.3696.
Epoch 267/500 Iteration 54340 | Training Loss: 1.3904.
Epoch 267/500 Iteration 54350 | Training Loss: 1.3763.
Epoch 267/500 Iteration 54360 | Training Loss: 1.3653.
Epoch 267/500 Iteration 54370 | Training Loss: 1.3608.
Epoch 267/500 Iteration 54380 | Training Loss: 1.3818.
Epoch 267/500 Iteration 54390 | Training Loss: 1.3597.
Epoch 267/500 Iteration 54400 | Training Loss: 1.3683.
Epoch 267/500 Iteration 54410 | Training Loss: 1.3250.
Epoch 267/500 Iteration 54420 | Training Loss: 1.4052.
Epoch 267/500 Iteration 54430 | Training Loss: 1.3740.
Epoch 267/500 Iteration 54440 | Training Loss: 1.3921.
Epoch 267/500 Iteration 54450 | Training Loss: 1.3978.
Epoch 267/

Epoch 274/500 Iteration 55770 | Training Loss: 1.3912.
Epoch 274/500 Iteration 55780 | Training Loss: 1.3855.
Epoch 274/500 Iteration 55790 | Training Loss: 1.3574.
Epoch 274/500 Iteration 55800 | Training Loss: 1.3597.
Epoch 274/500 Iteration 55810 | Training Loss: 1.3149.
Epoch 274/500 Iteration 55820 | Training Loss: 1.3703.
Epoch 274/500 Iteration 55830 | Training Loss: 1.3641.
Epoch 274/500 Iteration 55840 | Training Loss: 1.3196.
Epoch 274/500 Iteration 55850 | Training Loss: 1.3731.
Epoch 274/500 Iteration 55860 | Training Loss: 1.3678.
Epoch 274/500 Iteration 55870 | Training Loss: 1.3513.
Epoch 274/500 Iteration 55880 | Training Loss: 1.3505.
Epoch 274/500 Iteration 55890 | Training Loss: 1.3793.
Epoch 275/500 Iteration 55900 | Training Loss: 1.4233.
Epoch 275/500 Iteration 55910 | Training Loss: 1.3575.
Epoch 275/500 Iteration 55920 | Training Loss: 1.3969.
Epoch 275/500 Iteration 55930 | Training Loss: 1.3546.
Epoch 275/500 Iteration 55940 | Training Loss: 1.3600.
Epoch 275/

Epoch 281/500 Iteration 57260 | Training Loss: 1.3723.
Epoch 281/500 Iteration 57270 | Training Loss: 1.3722.
Epoch 281/500 Iteration 57280 | Training Loss: 1.4044.
Epoch 281/500 Iteration 57290 | Training Loss: 1.3621.
Epoch 281/500 Iteration 57300 | Training Loss: 1.4126.
Epoch 281/500 Iteration 57310 | Training Loss: 1.3841.
Epoch 281/500 Iteration 57320 | Training Loss: 1.3864.
Epoch 282/500 Iteration 57330 | Training Loss: 1.3850.
Epoch 282/500 Iteration 57340 | Training Loss: 1.3768.
Epoch 282/500 Iteration 57350 | Training Loss: 1.3473.
Epoch 282/500 Iteration 57360 | Training Loss: 1.3473.
Epoch 282/500 Iteration 57370 | Training Loss: 1.3761.
Epoch 282/500 Iteration 57380 | Training Loss: 1.3587.
Epoch 282/500 Iteration 57390 | Training Loss: 1.3702.
Epoch 282/500 Iteration 57400 | Training Loss: 1.3839.
Epoch 282/500 Iteration 57410 | Training Loss: 1.3753.
Epoch 282/500 Iteration 57420 | Training Loss: 1.3585.
Epoch 282/500 Iteration 57430 | Training Loss: 1.3588.
Epoch 282/

Epoch 288/500 Iteration 58750 | Training Loss: 1.4257.
Epoch 289/500 Iteration 58760 | Training Loss: 1.3531.
Epoch 289/500 Iteration 58770 | Training Loss: 1.3009.
Epoch 289/500 Iteration 58780 | Training Loss: 1.3624.
Epoch 289/500 Iteration 58790 | Training Loss: 1.3562.
Epoch 289/500 Iteration 58800 | Training Loss: 1.4286.
Epoch 289/500 Iteration 58810 | Training Loss: 1.3594.
Epoch 289/500 Iteration 58820 | Training Loss: 1.3709.
Epoch 289/500 Iteration 58830 | Training Loss: 1.4027.
Epoch 289/500 Iteration 58840 | Training Loss: 1.3731.
Epoch 289/500 Iteration 58850 | Training Loss: 1.3591.
Epoch 289/500 Iteration 58860 | Training Loss: 1.3587.
Epoch 289/500 Iteration 58870 | Training Loss: 1.3146.
Epoch 289/500 Iteration 58880 | Training Loss: 1.3706.
Epoch 289/500 Iteration 58890 | Training Loss: 1.3655.
Epoch 289/500 Iteration 58900 | Training Loss: 1.3106.
Epoch 289/500 Iteration 58910 | Training Loss: 1.3739.
Epoch 289/500 Iteration 58920 | Training Loss: 1.3567.
Epoch 289/

Epoch 296/500 Iteration 60240 | Training Loss: 1.3847.
Epoch 296/500 Iteration 60250 | Training Loss: 1.3723.
Epoch 296/500 Iteration 60260 | Training Loss: 1.3149.
Epoch 296/500 Iteration 60270 | Training Loss: 1.3563.
Epoch 296/500 Iteration 60280 | Training Loss: 1.3528.
Epoch 296/500 Iteration 60290 | Training Loss: 1.3125.
Epoch 296/500 Iteration 60300 | Training Loss: 1.3561.
Epoch 296/500 Iteration 60310 | Training Loss: 1.3419.
Epoch 296/500 Iteration 60320 | Training Loss: 1.3661.
Epoch 296/500 Iteration 60330 | Training Loss: 1.3654.
Epoch 296/500 Iteration 60340 | Training Loss: 1.4082.
Epoch 296/500 Iteration 60350 | Training Loss: 1.3688.
Epoch 296/500 Iteration 60360 | Training Loss: 1.3992.
Epoch 296/500 Iteration 60370 | Training Loss: 1.3597.
Epoch 296/500 Iteration 60380 | Training Loss: 1.3981.
Epoch 297/500 Iteration 60390 | Training Loss: 1.3661.
Epoch 297/500 Iteration 60400 | Training Loss: 1.3748.
Epoch 297/500 Iteration 60410 | Training Loss: 1.3540.
Epoch 297/

Epoch 303/500 Iteration 61730 | Training Loss: 1.3456.
Epoch 303/500 Iteration 61740 | Training Loss: 1.3065.
Epoch 303/500 Iteration 61750 | Training Loss: 1.3327.
Epoch 303/500 Iteration 61760 | Training Loss: 1.3286.
Epoch 303/500 Iteration 61770 | Training Loss: 1.4059.
Epoch 303/500 Iteration 61780 | Training Loss: 1.3627.
Epoch 303/500 Iteration 61790 | Training Loss: 1.3702.
Epoch 303/500 Iteration 61800 | Training Loss: 1.3980.
Epoch 303/500 Iteration 61810 | Training Loss: 1.4131.
Epoch 304/500 Iteration 61820 | Training Loss: 1.3683.
Epoch 304/500 Iteration 61830 | Training Loss: 1.2813.
Epoch 304/500 Iteration 61840 | Training Loss: 1.3662.
Epoch 304/500 Iteration 61850 | Training Loss: 1.3532.
Epoch 304/500 Iteration 61860 | Training Loss: 1.4292.
Epoch 304/500 Iteration 61870 | Training Loss: 1.3629.
Epoch 304/500 Iteration 61880 | Training Loss: 1.3610.
Epoch 304/500 Iteration 61890 | Training Loss: 1.4093.
Epoch 304/500 Iteration 61900 | Training Loss: 1.3745.
Epoch 304/

Epoch 310/500 Iteration 63220 | Training Loss: 1.3375.
Epoch 310/500 Iteration 63230 | Training Loss: 1.3372.
Epoch 310/500 Iteration 63240 | Training Loss: 1.3987.
Epoch 311/500 Iteration 63250 | Training Loss: 1.3660.
Epoch 311/500 Iteration 63260 | Training Loss: 1.3375.
Epoch 311/500 Iteration 63270 | Training Loss: 1.3325.
Epoch 311/500 Iteration 63280 | Training Loss: 1.3841.
Epoch 311/500 Iteration 63290 | Training Loss: 1.3546.
Epoch 311/500 Iteration 63300 | Training Loss: 1.3909.
Epoch 311/500 Iteration 63310 | Training Loss: 1.3634.
Epoch 311/500 Iteration 63320 | Training Loss: 1.3174.
Epoch 311/500 Iteration 63330 | Training Loss: 1.3604.
Epoch 311/500 Iteration 63340 | Training Loss: 1.3500.
Epoch 311/500 Iteration 63350 | Training Loss: 1.3071.
Epoch 311/500 Iteration 63360 | Training Loss: 1.3566.
Epoch 311/500 Iteration 63370 | Training Loss: 1.3410.
Epoch 311/500 Iteration 63380 | Training Loss: 1.3630.
Epoch 311/500 Iteration 63390 | Training Loss: 1.3533.
Epoch 311/

Epoch 318/500 Iteration 64710 | Training Loss: 1.3646.
Epoch 318/500 Iteration 64720 | Training Loss: 1.3653.
Epoch 318/500 Iteration 64730 | Training Loss: 1.3386.
Epoch 318/500 Iteration 64740 | Training Loss: 1.3445.
Epoch 318/500 Iteration 64750 | Training Loss: 1.3852.
Epoch 318/500 Iteration 64760 | Training Loss: 1.3530.
Epoch 318/500 Iteration 64770 | Training Loss: 1.3041.
Epoch 318/500 Iteration 64780 | Training Loss: 1.3352.
Epoch 318/500 Iteration 64790 | Training Loss: 1.3536.
Epoch 318/500 Iteration 64800 | Training Loss: 1.3193.
Epoch 318/500 Iteration 64810 | Training Loss: 1.3381.
Epoch 318/500 Iteration 64820 | Training Loss: 1.3429.
Epoch 318/500 Iteration 64830 | Training Loss: 1.3833.
Epoch 318/500 Iteration 64840 | Training Loss: 1.3470.
Epoch 318/500 Iteration 64850 | Training Loss: 1.3780.
Epoch 318/500 Iteration 64860 | Training Loss: 1.3842.
Epoch 318/500 Iteration 64870 | Training Loss: 1.4040.
Epoch 319/500 Iteration 64880 | Training Loss: 1.3650.
Epoch 319/

Epoch 325/500 Iteration 66200 | Training Loss: 1.3296.
Epoch 325/500 Iteration 66210 | Training Loss: 1.3506.
Epoch 325/500 Iteration 66220 | Training Loss: 1.3652.
Epoch 325/500 Iteration 66230 | Training Loss: 1.3720.
Epoch 325/500 Iteration 66240 | Training Loss: 1.3229.
Epoch 325/500 Iteration 66250 | Training Loss: 1.3453.
Epoch 325/500 Iteration 66260 | Training Loss: 1.3503.
Epoch 325/500 Iteration 66270 | Training Loss: 1.3981.
Epoch 325/500 Iteration 66280 | Training Loss: 1.3469.
Epoch 325/500 Iteration 66290 | Training Loss: 1.3353.
Epoch 325/500 Iteration 66300 | Training Loss: 1.3835.
Epoch 326/500 Iteration 66310 | Training Loss: 1.3732.
Epoch 326/500 Iteration 66320 | Training Loss: 1.3344.
Epoch 326/500 Iteration 66330 | Training Loss: 1.3348.
Epoch 326/500 Iteration 66340 | Training Loss: 1.3812.
Epoch 326/500 Iteration 66350 | Training Loss: 1.3594.
Epoch 326/500 Iteration 66360 | Training Loss: 1.3723.
Epoch 326/500 Iteration 66370 | Training Loss: 1.3746.
Epoch 326/

Epoch 332/500 Iteration 67690 | Training Loss: 1.3589.
Epoch 332/500 Iteration 67700 | Training Loss: 1.3855.
Epoch 332/500 Iteration 67710 | Training Loss: 1.3849.
Epoch 332/500 Iteration 67720 | Training Loss: 1.4339.
Epoch 333/500 Iteration 67730 | Training Loss: 1.4005.
Epoch 333/500 Iteration 67740 | Training Loss: 1.3856.
Epoch 333/500 Iteration 67750 | Training Loss: 1.4141.
Epoch 333/500 Iteration 67760 | Training Loss: 1.3349.
Epoch 333/500 Iteration 67770 | Training Loss: 1.3836.
Epoch 333/500 Iteration 67780 | Training Loss: 1.3507.
Epoch 333/500 Iteration 67790 | Training Loss: 1.3270.
Epoch 333/500 Iteration 67800 | Training Loss: 1.3467.
Epoch 333/500 Iteration 67810 | Training Loss: 1.3974.
Epoch 333/500 Iteration 67820 | Training Loss: 1.3538.
Epoch 333/500 Iteration 67830 | Training Loss: 1.3106.
Epoch 333/500 Iteration 67840 | Training Loss: 1.3149.
Epoch 333/500 Iteration 67850 | Training Loss: 1.3645.
Epoch 333/500 Iteration 67860 | Training Loss: 1.3043.
Epoch 333/

Epoch 340/500 Iteration 69180 | Training Loss: 1.3693.
Epoch 340/500 Iteration 69190 | Training Loss: 1.3454.
Epoch 340/500 Iteration 69200 | Training Loss: 1.3479.
Epoch 340/500 Iteration 69210 | Training Loss: 1.3907.
Epoch 340/500 Iteration 69220 | Training Loss: 1.3426.
Epoch 340/500 Iteration 69230 | Training Loss: 1.3625.
Epoch 340/500 Iteration 69240 | Training Loss: 1.3413.
Epoch 340/500 Iteration 69250 | Training Loss: 1.3273.
Epoch 340/500 Iteration 69260 | Training Loss: 1.3323.
Epoch 340/500 Iteration 69270 | Training Loss: 1.3514.
Epoch 340/500 Iteration 69280 | Training Loss: 1.3636.
Epoch 340/500 Iteration 69290 | Training Loss: 1.3616.
Epoch 340/500 Iteration 69300 | Training Loss: 1.3213.
Epoch 340/500 Iteration 69310 | Training Loss: 1.3615.
Epoch 340/500 Iteration 69320 | Training Loss: 1.3656.
Epoch 340/500 Iteration 69330 | Training Loss: 1.3885.
Epoch 340/500 Iteration 69340 | Training Loss: 1.3559.
Epoch 340/500 Iteration 69350 | Training Loss: 1.3348.
Epoch 340/

Epoch 347/500 Iteration 70670 | Training Loss: 1.3601.
Epoch 347/500 Iteration 70680 | Training Loss: 1.3421.
Epoch 347/500 Iteration 70690 | Training Loss: 1.3456.
Epoch 347/500 Iteration 70700 | Training Loss: 1.3724.
Epoch 347/500 Iteration 70710 | Training Loss: 1.3405.
Epoch 347/500 Iteration 70720 | Training Loss: 1.3419.
Epoch 347/500 Iteration 70730 | Training Loss: 1.3126.
Epoch 347/500 Iteration 70740 | Training Loss: 1.3712.
Epoch 347/500 Iteration 70750 | Training Loss: 1.3538.
Epoch 347/500 Iteration 70760 | Training Loss: 1.3701.
Epoch 347/500 Iteration 70770 | Training Loss: 1.3719.
Epoch 347/500 Iteration 70780 | Training Loss: 1.4382.
Epoch 348/500 Iteration 70790 | Training Loss: 1.3922.
Epoch 348/500 Iteration 70800 | Training Loss: 1.3922.
Epoch 348/500 Iteration 70810 | Training Loss: 1.3969.
Epoch 348/500 Iteration 70820 | Training Loss: 1.3249.
Epoch 348/500 Iteration 70830 | Training Loss: 1.3603.
Epoch 348/500 Iteration 70840 | Training Loss: 1.3475.
Epoch 348/

Epoch 354/500 Iteration 72160 | Training Loss: 1.2938.
Epoch 354/500 Iteration 72170 | Training Loss: 1.3714.
Epoch 354/500 Iteration 72180 | Training Loss: 1.3371.
Epoch 354/500 Iteration 72190 | Training Loss: 1.3539.
Epoch 354/500 Iteration 72200 | Training Loss: 1.3356.
Epoch 354/500 Iteration 72210 | Training Loss: 1.3676.
Epoch 355/500 Iteration 72220 | Training Loss: 1.4025.
Epoch 355/500 Iteration 72230 | Training Loss: 1.3274.
Epoch 355/500 Iteration 72240 | Training Loss: 1.3835.
Epoch 355/500 Iteration 72250 | Training Loss: 1.3444.
Epoch 355/500 Iteration 72260 | Training Loss: 1.3385.
Epoch 355/500 Iteration 72270 | Training Loss: 1.3991.
Epoch 355/500 Iteration 72280 | Training Loss: 1.3463.
Epoch 355/500 Iteration 72290 | Training Loss: 1.3424.
Epoch 355/500 Iteration 72300 | Training Loss: 1.3549.
Epoch 355/500 Iteration 72310 | Training Loss: 1.3270.
Epoch 355/500 Iteration 72320 | Training Loss: 1.3271.
Epoch 355/500 Iteration 72330 | Training Loss: 1.3494.
Epoch 355/

Epoch 362/500 Iteration 73650 | Training Loss: 1.3618.
Epoch 362/500 Iteration 73660 | Training Loss: 1.3562.
Epoch 362/500 Iteration 73670 | Training Loss: 1.3403.
Epoch 362/500 Iteration 73680 | Training Loss: 1.3433.
Epoch 362/500 Iteration 73690 | Training Loss: 1.3530.
Epoch 362/500 Iteration 73700 | Training Loss: 1.3307.
Epoch 362/500 Iteration 73710 | Training Loss: 1.3536.
Epoch 362/500 Iteration 73720 | Training Loss: 1.3622.
Epoch 362/500 Iteration 73730 | Training Loss: 1.3573.
Epoch 362/500 Iteration 73740 | Training Loss: 1.3504.
Epoch 362/500 Iteration 73750 | Training Loss: 1.3486.
Epoch 362/500 Iteration 73760 | Training Loss: 1.3760.
Epoch 362/500 Iteration 73770 | Training Loss: 1.3452.
Epoch 362/500 Iteration 73780 | Training Loss: 1.3480.
Epoch 362/500 Iteration 73790 | Training Loss: 1.3235.
Epoch 362/500 Iteration 73800 | Training Loss: 1.3783.
Epoch 362/500 Iteration 73810 | Training Loss: 1.3533.
Epoch 362/500 Iteration 73820 | Training Loss: 1.3775.
Epoch 362/

Epoch 369/500 Iteration 75140 | Training Loss: 1.3497.
Epoch 369/500 Iteration 75150 | Training Loss: 1.3758.
Epoch 369/500 Iteration 75160 | Training Loss: 1.3681.
Epoch 369/500 Iteration 75170 | Training Loss: 1.3469.
Epoch 369/500 Iteration 75180 | Training Loss: 1.3567.
Epoch 369/500 Iteration 75190 | Training Loss: 1.3047.
Epoch 369/500 Iteration 75200 | Training Loss: 1.3576.
Epoch 369/500 Iteration 75210 | Training Loss: 1.3530.
Epoch 369/500 Iteration 75220 | Training Loss: 1.2938.
Epoch 369/500 Iteration 75230 | Training Loss: 1.3722.
Epoch 369/500 Iteration 75240 | Training Loss: 1.3597.
Epoch 369/500 Iteration 75250 | Training Loss: 1.3468.
Epoch 369/500 Iteration 75260 | Training Loss: 1.3304.
Epoch 369/500 Iteration 75270 | Training Loss: 1.3598.
Epoch 370/500 Iteration 75280 | Training Loss: 1.3974.
Epoch 370/500 Iteration 75290 | Training Loss: 1.3408.
Epoch 370/500 Iteration 75300 | Training Loss: 1.3713.
Epoch 370/500 Iteration 75310 | Training Loss: 1.3467.
Epoch 370/

Epoch 376/500 Iteration 76630 | Training Loss: 1.3353.
Epoch 376/500 Iteration 76640 | Training Loss: 1.3543.
Epoch 376/500 Iteration 76650 | Training Loss: 1.3531.
Epoch 376/500 Iteration 76660 | Training Loss: 1.3907.
Epoch 376/500 Iteration 76670 | Training Loss: 1.3586.
Epoch 376/500 Iteration 76680 | Training Loss: 1.3748.
Epoch 376/500 Iteration 76690 | Training Loss: 1.3570.
Epoch 376/500 Iteration 76700 | Training Loss: 1.3714.
Epoch 377/500 Iteration 76710 | Training Loss: 1.3684.
Epoch 377/500 Iteration 76720 | Training Loss: 1.3657.
Epoch 377/500 Iteration 76730 | Training Loss: 1.3257.
Epoch 377/500 Iteration 76740 | Training Loss: 1.3272.
Epoch 377/500 Iteration 76750 | Training Loss: 1.3481.
Epoch 377/500 Iteration 76760 | Training Loss: 1.3246.
Epoch 377/500 Iteration 76770 | Training Loss: 1.3361.
Epoch 377/500 Iteration 76780 | Training Loss: 1.3593.
Epoch 377/500 Iteration 76790 | Training Loss: 1.3432.
Epoch 377/500 Iteration 76800 | Training Loss: 1.3432.
Epoch 377/

Epoch 383/500 Iteration 78120 | Training Loss: 1.3770.
Epoch 383/500 Iteration 78130 | Training Loss: 1.4008.
Epoch 384/500 Iteration 78140 | Training Loss: 1.3575.
Epoch 384/500 Iteration 78150 | Training Loss: 1.2735.
Epoch 384/500 Iteration 78160 | Training Loss: 1.3482.
Epoch 384/500 Iteration 78170 | Training Loss: 1.3410.
Epoch 384/500 Iteration 78180 | Training Loss: 1.4154.
Epoch 384/500 Iteration 78190 | Training Loss: 1.3631.
Epoch 384/500 Iteration 78200 | Training Loss: 1.3604.
Epoch 384/500 Iteration 78210 | Training Loss: 1.3876.
Epoch 384/500 Iteration 78220 | Training Loss: 1.3505.
Epoch 384/500 Iteration 78230 | Training Loss: 1.3495.
Epoch 384/500 Iteration 78240 | Training Loss: 1.3403.
Epoch 384/500 Iteration 78250 | Training Loss: 1.3015.
Epoch 384/500 Iteration 78260 | Training Loss: 1.3437.
Epoch 384/500 Iteration 78270 | Training Loss: 1.3430.
Epoch 384/500 Iteration 78280 | Training Loss: 1.2942.
Epoch 384/500 Iteration 78290 | Training Loss: 1.3703.
Epoch 384/

Epoch 391/500 Iteration 79610 | Training Loss: 1.3426.
Epoch 391/500 Iteration 79620 | Training Loss: 1.3675.
Epoch 391/500 Iteration 79630 | Training Loss: 1.3658.
Epoch 391/500 Iteration 79640 | Training Loss: 1.3086.
Epoch 391/500 Iteration 79650 | Training Loss: 1.3379.
Epoch 391/500 Iteration 79660 | Training Loss: 1.3570.
Epoch 391/500 Iteration 79670 | Training Loss: 1.3046.
Epoch 391/500 Iteration 79680 | Training Loss: 1.3468.
Epoch 391/500 Iteration 79690 | Training Loss: 1.3371.
Epoch 391/500 Iteration 79700 | Training Loss: 1.3482.
Epoch 391/500 Iteration 79710 | Training Loss: 1.3397.
Epoch 391/500 Iteration 79720 | Training Loss: 1.3841.
Epoch 391/500 Iteration 79730 | Training Loss: 1.3564.
Epoch 391/500 Iteration 79740 | Training Loss: 1.3786.
Epoch 391/500 Iteration 79750 | Training Loss: 1.3596.
Epoch 391/500 Iteration 79760 | Training Loss: 1.3639.
Epoch 392/500 Iteration 79770 | Training Loss: 1.3559.
Epoch 392/500 Iteration 79780 | Training Loss: 1.3580.
Epoch 392/

Epoch 398/500 Iteration 81100 | Training Loss: 1.3175.
Epoch 398/500 Iteration 81110 | Training Loss: 1.3458.
Epoch 398/500 Iteration 81120 | Training Loss: 1.3078.
Epoch 398/500 Iteration 81130 | Training Loss: 1.3245.
Epoch 398/500 Iteration 81140 | Training Loss: 1.3307.
Epoch 398/500 Iteration 81150 | Training Loss: 1.3828.
Epoch 398/500 Iteration 81160 | Training Loss: 1.3491.
Epoch 398/500 Iteration 81170 | Training Loss: 1.3559.
Epoch 398/500 Iteration 81180 | Training Loss: 1.3863.
Epoch 398/500 Iteration 81190 | Training Loss: 1.4097.
Epoch 399/500 Iteration 81200 | Training Loss: 1.3537.
Epoch 399/500 Iteration 81210 | Training Loss: 1.2737.
Epoch 399/500 Iteration 81220 | Training Loss: 1.3678.
Epoch 399/500 Iteration 81230 | Training Loss: 1.3498.
Epoch 399/500 Iteration 81240 | Training Loss: 1.4230.
Epoch 399/500 Iteration 81250 | Training Loss: 1.3625.
Epoch 399/500 Iteration 81260 | Training Loss: 1.3648.
Epoch 399/500 Iteration 81270 | Training Loss: 1.3867.
Epoch 399/

Epoch 405/500 Iteration 82590 | Training Loss: 1.3890.
Epoch 405/500 Iteration 82600 | Training Loss: 1.3520.
Epoch 405/500 Iteration 82610 | Training Loss: 1.3405.
Epoch 405/500 Iteration 82620 | Training Loss: 1.3736.
Epoch 406/500 Iteration 82630 | Training Loss: 1.3594.
Epoch 406/500 Iteration 82640 | Training Loss: 1.3340.
Epoch 406/500 Iteration 82650 | Training Loss: 1.3389.
Epoch 406/500 Iteration 82660 | Training Loss: 1.3785.
Epoch 406/500 Iteration 82670 | Training Loss: 1.3399.
Epoch 406/500 Iteration 82680 | Training Loss: 1.3514.
Epoch 406/500 Iteration 82690 | Training Loss: 1.3651.
Epoch 406/500 Iteration 82700 | Training Loss: 1.3014.
Epoch 406/500 Iteration 82710 | Training Loss: 1.3492.
Epoch 406/500 Iteration 82720 | Training Loss: 1.3429.
Epoch 406/500 Iteration 82730 | Training Loss: 1.3065.
Epoch 406/500 Iteration 82740 | Training Loss: 1.3400.
Epoch 406/500 Iteration 82750 | Training Loss: 1.3294.
Epoch 406/500 Iteration 82760 | Training Loss: 1.3495.
Epoch 406/

Epoch 413/500 Iteration 84080 | Training Loss: 1.3398.
Epoch 413/500 Iteration 84090 | Training Loss: 1.3546.
Epoch 413/500 Iteration 84100 | Training Loss: 1.3518.
Epoch 413/500 Iteration 84110 | Training Loss: 1.3428.
Epoch 413/500 Iteration 84120 | Training Loss: 1.3324.
Epoch 413/500 Iteration 84130 | Training Loss: 1.3804.
Epoch 413/500 Iteration 84140 | Training Loss: 1.3420.
Epoch 413/500 Iteration 84150 | Training Loss: 1.3013.
Epoch 413/500 Iteration 84160 | Training Loss: 1.3169.
Epoch 413/500 Iteration 84170 | Training Loss: 1.3489.
Epoch 413/500 Iteration 84180 | Training Loss: 1.3098.
Epoch 413/500 Iteration 84190 | Training Loss: 1.3269.
Epoch 413/500 Iteration 84200 | Training Loss: 1.3199.
Epoch 413/500 Iteration 84210 | Training Loss: 1.3791.
Epoch 413/500 Iteration 84220 | Training Loss: 1.3468.
Epoch 413/500 Iteration 84230 | Training Loss: 1.3616.
Epoch 413/500 Iteration 84240 | Training Loss: 1.3689.
Epoch 413/500 Iteration 84250 | Training Loss: 1.4008.
Epoch 414/

Epoch 420/500 Iteration 85570 | Training Loss: 1.3222.
Epoch 420/500 Iteration 85580 | Training Loss: 1.3175.
Epoch 420/500 Iteration 85590 | Training Loss: 1.3407.
Epoch 420/500 Iteration 85600 | Training Loss: 1.3617.
Epoch 420/500 Iteration 85610 | Training Loss: 1.3626.
Epoch 420/500 Iteration 85620 | Training Loss: 1.3137.
Epoch 420/500 Iteration 85630 | Training Loss: 1.3392.
Epoch 420/500 Iteration 85640 | Training Loss: 1.3557.
Epoch 420/500 Iteration 85650 | Training Loss: 1.3806.
Epoch 420/500 Iteration 85660 | Training Loss: 1.3515.
Epoch 420/500 Iteration 85670 | Training Loss: 1.3392.
Epoch 420/500 Iteration 85680 | Training Loss: 1.3723.
Epoch 421/500 Iteration 85690 | Training Loss: 1.3629.
Epoch 421/500 Iteration 85700 | Training Loss: 1.3252.
Epoch 421/500 Iteration 85710 | Training Loss: 1.3319.
Epoch 421/500 Iteration 85720 | Training Loss: 1.3719.
Epoch 421/500 Iteration 85730 | Training Loss: 1.3493.
Epoch 421/500 Iteration 85740 | Training Loss: 1.3579.
Epoch 421/

Epoch 427/500 Iteration 87060 | Training Loss: 1.3758.
Epoch 427/500 Iteration 87070 | Training Loss: 1.3407.
Epoch 427/500 Iteration 87080 | Training Loss: 1.3648.
Epoch 427/500 Iteration 87090 | Training Loss: 1.3731.
Epoch 427/500 Iteration 87100 | Training Loss: 1.4290.
Epoch 428/500 Iteration 87110 | Training Loss: 1.3869.
Epoch 428/500 Iteration 87120 | Training Loss: 1.3715.
Epoch 428/500 Iteration 87130 | Training Loss: 1.4087.
Epoch 428/500 Iteration 87140 | Training Loss: 1.3185.
Epoch 428/500 Iteration 87150 | Training Loss: 1.3538.
Epoch 428/500 Iteration 87160 | Training Loss: 1.3461.
Epoch 428/500 Iteration 87170 | Training Loss: 1.3397.
Epoch 428/500 Iteration 87180 | Training Loss: 1.3389.
Epoch 428/500 Iteration 87190 | Training Loss: 1.3747.
Epoch 428/500 Iteration 87200 | Training Loss: 1.3450.
Epoch 428/500 Iteration 87210 | Training Loss: 1.2918.
Epoch 428/500 Iteration 87220 | Training Loss: 1.3131.
Epoch 428/500 Iteration 87230 | Training Loss: 1.3311.
Epoch 428/

Epoch 435/500 Iteration 88550 | Training Loss: 1.3296.
Epoch 435/500 Iteration 88560 | Training Loss: 1.3681.
Epoch 435/500 Iteration 88570 | Training Loss: 1.3258.
Epoch 435/500 Iteration 88580 | Training Loss: 1.3379.
Epoch 435/500 Iteration 88590 | Training Loss: 1.3888.
Epoch 435/500 Iteration 88600 | Training Loss: 1.3401.
Epoch 435/500 Iteration 88610 | Training Loss: 1.3503.
Epoch 435/500 Iteration 88620 | Training Loss: 1.3422.
Epoch 435/500 Iteration 88630 | Training Loss: 1.3199.
Epoch 435/500 Iteration 88640 | Training Loss: 1.3223.
Epoch 435/500 Iteration 88650 | Training Loss: 1.3426.
Epoch 435/500 Iteration 88660 | Training Loss: 1.3518.
Epoch 435/500 Iteration 88670 | Training Loss: 1.3658.
Epoch 435/500 Iteration 88680 | Training Loss: 1.3225.
Epoch 435/500 Iteration 88690 | Training Loss: 1.3514.
Epoch 435/500 Iteration 88700 | Training Loss: 1.3411.
Epoch 435/500 Iteration 88710 | Training Loss: 1.3755.
Epoch 435/500 Iteration 88720 | Training Loss: 1.3400.
Epoch 435/

Epoch 442/500 Iteration 90040 | Training Loss: 1.3558.
Epoch 442/500 Iteration 90050 | Training Loss: 1.3444.
Epoch 442/500 Iteration 90060 | Training Loss: 1.3448.
Epoch 442/500 Iteration 90070 | Training Loss: 1.3339.
Epoch 442/500 Iteration 90080 | Training Loss: 1.3634.
Epoch 442/500 Iteration 90090 | Training Loss: 1.3349.
Epoch 442/500 Iteration 90100 | Training Loss: 1.3475.
Epoch 442/500 Iteration 90110 | Training Loss: 1.2953.
Epoch 442/500 Iteration 90120 | Training Loss: 1.3817.
Epoch 442/500 Iteration 90130 | Training Loss: 1.3529.
Epoch 442/500 Iteration 90140 | Training Loss: 1.3808.
Epoch 442/500 Iteration 90150 | Training Loss: 1.3612.
Epoch 442/500 Iteration 90160 | Training Loss: 1.4159.
Epoch 443/500 Iteration 90170 | Training Loss: 1.3862.
Epoch 443/500 Iteration 90180 | Training Loss: 1.3640.
Epoch 443/500 Iteration 90190 | Training Loss: 1.3775.
Epoch 443/500 Iteration 90200 | Training Loss: 1.3170.
Epoch 443/500 Iteration 90210 | Training Loss: 1.3549.
Epoch 443/

Epoch 449/500 Iteration 91530 | Training Loss: 1.3523.
Epoch 449/500 Iteration 91540 | Training Loss: 1.2932.
Epoch 449/500 Iteration 91550 | Training Loss: 1.3614.
Epoch 449/500 Iteration 91560 | Training Loss: 1.3399.
Epoch 449/500 Iteration 91570 | Training Loss: 1.3302.
Epoch 449/500 Iteration 91580 | Training Loss: 1.3435.
Epoch 449/500 Iteration 91590 | Training Loss: 1.3599.
Epoch 450/500 Iteration 91600 | Training Loss: 1.3838.
Epoch 450/500 Iteration 91610 | Training Loss: 1.3312.
Epoch 450/500 Iteration 91620 | Training Loss: 1.3718.
Epoch 450/500 Iteration 91630 | Training Loss: 1.3365.
Epoch 450/500 Iteration 91640 | Training Loss: 1.3384.
Epoch 450/500 Iteration 91650 | Training Loss: 1.3563.
Epoch 450/500 Iteration 91660 | Training Loss: 1.3333.
Epoch 450/500 Iteration 91670 | Training Loss: 1.3421.
Epoch 450/500 Iteration 91680 | Training Loss: 1.3464.
Epoch 450/500 Iteration 91690 | Training Loss: 1.3056.
Epoch 450/500 Iteration 91700 | Training Loss: 1.3218.
Epoch 450/

Epoch 456/500 Iteration 93020 | Training Loss: 1.3457.
Epoch 457/500 Iteration 93030 | Training Loss: 1.3596.
Epoch 457/500 Iteration 93040 | Training Loss: 1.3598.
Epoch 457/500 Iteration 93050 | Training Loss: 1.3300.
Epoch 457/500 Iteration 93060 | Training Loss: 1.3235.
Epoch 457/500 Iteration 93070 | Training Loss: 1.3442.
Epoch 457/500 Iteration 93080 | Training Loss: 1.3259.
Epoch 457/500 Iteration 93090 | Training Loss: 1.3447.
Epoch 457/500 Iteration 93100 | Training Loss: 1.3680.
Epoch 457/500 Iteration 93110 | Training Loss: 1.3399.
Epoch 457/500 Iteration 93120 | Training Loss: 1.3284.
Epoch 457/500 Iteration 93130 | Training Loss: 1.3425.
Epoch 457/500 Iteration 93140 | Training Loss: 1.3605.
Epoch 457/500 Iteration 93150 | Training Loss: 1.3299.
Epoch 457/500 Iteration 93160 | Training Loss: 1.3421.
Epoch 457/500 Iteration 93170 | Training Loss: 1.2889.
Epoch 457/500 Iteration 93180 | Training Loss: 1.3781.
Epoch 457/500 Iteration 93190 | Training Loss: 1.3488.
Epoch 457/

Epoch 464/500 Iteration 94510 | Training Loss: 1.3620.
Epoch 464/500 Iteration 94520 | Training Loss: 1.3516.
Epoch 464/500 Iteration 94530 | Training Loss: 1.3834.
Epoch 464/500 Iteration 94540 | Training Loss: 1.3792.
Epoch 464/500 Iteration 94550 | Training Loss: 1.3288.
Epoch 464/500 Iteration 94560 | Training Loss: 1.3475.
Epoch 464/500 Iteration 94570 | Training Loss: 1.2894.
Epoch 464/500 Iteration 94580 | Training Loss: 1.3403.
Epoch 464/500 Iteration 94590 | Training Loss: 1.3427.
Epoch 464/500 Iteration 94600 | Training Loss: 1.2944.
Epoch 464/500 Iteration 94610 | Training Loss: 1.3595.
Epoch 464/500 Iteration 94620 | Training Loss: 1.3357.
Epoch 464/500 Iteration 94630 | Training Loss: 1.3360.
Epoch 464/500 Iteration 94640 | Training Loss: 1.3303.
Epoch 464/500 Iteration 94650 | Training Loss: 1.3545.
Epoch 465/500 Iteration 94660 | Training Loss: 1.3750.
Epoch 465/500 Iteration 94670 | Training Loss: 1.3337.
Epoch 465/500 Iteration 94680 | Training Loss: 1.3655.
Epoch 465/

Epoch 471/500 Iteration 96000 | Training Loss: 1.3468.
Epoch 471/500 Iteration 96010 | Training Loss: 1.3242.
Epoch 471/500 Iteration 96020 | Training Loss: 1.3427.
Epoch 471/500 Iteration 96030 | Training Loss: 1.3422.
Epoch 471/500 Iteration 96040 | Training Loss: 1.3827.
Epoch 471/500 Iteration 96050 | Training Loss: 1.3441.
Epoch 471/500 Iteration 96060 | Training Loss: 1.3802.
Epoch 471/500 Iteration 96070 | Training Loss: 1.3558.
Epoch 471/500 Iteration 96080 | Training Loss: 1.3690.
Epoch 472/500 Iteration 96090 | Training Loss: 1.3526.
Epoch 472/500 Iteration 96100 | Training Loss: 1.3552.
Epoch 472/500 Iteration 96110 | Training Loss: 1.3265.
Epoch 472/500 Iteration 96120 | Training Loss: 1.3393.
Epoch 472/500 Iteration 96130 | Training Loss: 1.3411.
Epoch 472/500 Iteration 96140 | Training Loss: 1.3181.
Epoch 472/500 Iteration 96150 | Training Loss: 1.3381.
Epoch 472/500 Iteration 96160 | Training Loss: 1.3626.
Epoch 472/500 Iteration 96170 | Training Loss: 1.3369.
Epoch 472/

Epoch 478/500 Iteration 97490 | Training Loss: 1.3516.
Epoch 478/500 Iteration 97500 | Training Loss: 1.3725.
Epoch 478/500 Iteration 97510 | Training Loss: 1.3916.
Epoch 479/500 Iteration 97520 | Training Loss: 1.3541.
Epoch 479/500 Iteration 97530 | Training Loss: 1.2838.
Epoch 479/500 Iteration 97540 | Training Loss: 1.3454.
Epoch 479/500 Iteration 97550 | Training Loss: 1.3400.
Epoch 479/500 Iteration 97560 | Training Loss: 1.4238.
Epoch 479/500 Iteration 97570 | Training Loss: 1.3518.
Epoch 479/500 Iteration 97580 | Training Loss: 1.3530.
Epoch 479/500 Iteration 97590 | Training Loss: 1.3708.
Epoch 479/500 Iteration 97600 | Training Loss: 1.3588.
Epoch 479/500 Iteration 97610 | Training Loss: 1.3383.
Epoch 479/500 Iteration 97620 | Training Loss: 1.3448.
Epoch 479/500 Iteration 97630 | Training Loss: 1.2971.
Epoch 479/500 Iteration 97640 | Training Loss: 1.3435.
Epoch 479/500 Iteration 97650 | Training Loss: 1.3567.
Epoch 479/500 Iteration 97660 | Training Loss: 1.2831.
Epoch 479/

Epoch 486/500 Iteration 98980 | Training Loss: 1.3765.
Epoch 486/500 Iteration 98990 | Training Loss: 1.3330.
Epoch 486/500 Iteration 99000 | Training Loss: 1.3630.
Epoch 486/500 Iteration 99010 | Training Loss: 1.3485.
Epoch 486/500 Iteration 99020 | Training Loss: 1.3169.
Epoch 486/500 Iteration 99030 | Training Loss: 1.3390.
Epoch 486/500 Iteration 99040 | Training Loss: 1.3534.
Epoch 486/500 Iteration 99050 | Training Loss: 1.2864.
Epoch 486/500 Iteration 99060 | Training Loss: 1.3380.
Epoch 486/500 Iteration 99070 | Training Loss: 1.3326.
Epoch 486/500 Iteration 99080 | Training Loss: 1.3504.
Epoch 486/500 Iteration 99090 | Training Loss: 1.3384.
Epoch 486/500 Iteration 99100 | Training Loss: 1.3846.
Epoch 486/500 Iteration 99110 | Training Loss: 1.3424.
Epoch 486/500 Iteration 99120 | Training Loss: 1.3818.
Epoch 486/500 Iteration 99130 | Training Loss: 1.3490.
Epoch 486/500 Iteration 99140 | Training Loss: 1.3735.
Epoch 487/500 Iteration 99150 | Training Loss: 1.3544.
Epoch 487/

Epoch 493/500 Iteration 100470 | Training Loss: 1.2937.
Epoch 493/500 Iteration 100480 | Training Loss: 1.3088.
Epoch 493/500 Iteration 100490 | Training Loss: 1.3363.
Epoch 493/500 Iteration 100500 | Training Loss: 1.2971.
Epoch 493/500 Iteration 100510 | Training Loss: 1.3201.
Epoch 493/500 Iteration 100520 | Training Loss: 1.3218.
Epoch 493/500 Iteration 100530 | Training Loss: 1.3801.
Epoch 493/500 Iteration 100540 | Training Loss: 1.3290.
Epoch 493/500 Iteration 100550 | Training Loss: 1.3372.
Epoch 493/500 Iteration 100560 | Training Loss: 1.3680.
Epoch 493/500 Iteration 100570 | Training Loss: 1.3856.
Epoch 494/500 Iteration 100580 | Training Loss: 1.3524.
Epoch 494/500 Iteration 100590 | Training Loss: 1.2745.
Epoch 494/500 Iteration 100600 | Training Loss: 1.3358.
Epoch 494/500 Iteration 100610 | Training Loss: 1.3315.
Epoch 494/500 Iteration 100620 | Training Loss: 1.4162.
Epoch 494/500 Iteration 100630 | Training Loss: 1.3404.
Epoch 494/500 Iteration 100640 | Training Loss: 

Epoch 500/500 Iteration 101940 | Training Loss: 1.3078.
Epoch 500/500 Iteration 101950 | Training Loss: 1.3459.
Epoch 500/500 Iteration 101960 | Training Loss: 1.3386.
Epoch 500/500 Iteration 101970 | Training Loss: 1.3773.
Epoch 500/500 Iteration 101980 | Training Loss: 1.3429.
Epoch 500/500 Iteration 101990 | Training Loss: 1.3138.
Epoch 500/500 Iteration 102000 | Training Loss: 1.3624.


## The CharRNN model in the sampling mode

Next up, we can create a new instance of the `CharRNN` class in the sampling mode by specifying that `sampling=True`.  We'll call the `sample` method to load the saved model in the `./model-250/` folder, and generate a sequence of 500 characters:

In [53]:
del rnn

np.random.seed(123)
rnn = CharRNN(len(chars), sampling=True)
print(rnn.sample(ckpt_dir="./model-500/", output_length=500))

  << lstm_outputs >> Tensor("rnn/transpose_1:0", shape=(1, 1, 128), dtype=float32)
INFO:tensorflow:Restoring parameters from ./model-500/language_modeling.ckpt
Men will ball to the person.
The mind and saying. To an
arrivel was his constantly he was till on the place at the cardinal, the stell tows the taken with his passess in the silenty and the thoughts stained on the service of suscetingen her secrece as he had shone is the partity with in that of the
sare; the charmen would not the took were bucking that she and the procurator’s wife of the crime the post, steps that he had someone the cast at himself to having, and the memenon to the tapponessenss, we


The generated text will look like the following:

<img src="images/16_15.png" style="width:500px">

You can see that in the resulting output, that some English words are mostly preserved.  It's also important to note that this is from an old English text, therefore, some words in the original text may be unfamiliar.  To get a better result, we would need to train for higher number of epochs.  Feel free to repeat this with a much larger document and train the model for more epochs.