Deep Learning
=============

Assignment 6
------------

After training a skip-gram model in `5_word2vec.ipynb`, the goal of this notebook is to train a LSTM character model over [Text8](http://mattmahoney.net/dc/textdata) data.

In [1]:
# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
from __future__ import print_function
import os
import numpy as np
import random
import string
import tensorflow as tf
import zipfile
from six.moves import range
from six.moves.urllib.request import urlretrieve

In [4]:
url = 'http://mattmahoney.net/dc/'

def maybe_download(filename, expected_bytes):
    """Download a file if not present, and make sure it's the right size."""
    if not os.path.exists(filename):
        filename, _ = urlretrieve(url + filename, filename)
    statinfo = os.stat(filename)
    if statinfo.st_size == expected_bytes:
        print('Found and verified %s' % filename)
    else:
        print(statinfo.st_size)
        raise Exception('Failed to verify ' + filename + '. Can you get to it with a browser?')
    return filename

filename = maybe_download('text8.zip', 31344016)

10838016


Exception: Failed to verify text8.zip. Can you get to it with a browser?

In [6]:
def read_data(filename):
    with zipfile.ZipFile(filename) as f:
        name = f.namelist()[0]
        data = tf.compat.as_str(f.read(name))
    return data
  
filename = "../datasets/text8.zip"
text = read_data(filename)
print('Data size %d' % len(text))

Data size 100000000


Create a small validation set.

In [7]:
valid_size = 1000
valid_text = text[:valid_size]
train_text = text[valid_size:]
train_size = len(train_text)
print(train_size, train_text[:64])
print(valid_size, valid_text[:64])

99999000 ons anarchists advocate social relations based upon voluntary as
1000  anarchism originated as a term of abuse first used against earl


Utility functions to map characters to vocabulary IDs and back.

In [8]:
vocabulary_size = len(string.ascii_lowercase) + 1 # [a-z] + ' '
first_letter = ord(string.ascii_lowercase[0])

def char2id(char):
    if char in string.ascii_lowercase:
        return ord(char) - first_letter + 1
    elif char == ' ':
        return 0
    else:
        print('Unexpected character: %s' % char)
        return 0

def id2char(dictid):
    if dictid > 0:
        return chr(dictid + first_letter - 1)
    else:
        return ' '

print(char2id('a'), char2id('z'), char2id(' '), char2id('ï'))
print(id2char(1), id2char(26), id2char(0))

Unexpected character: ï
1 26 0 0
a z  


Function to generate a training batch for the LSTM model.

In [9]:
batch_size = 64
num_unrollings = 10

class BatchGenerator(object):
    def __init__(self, text, batch_size, num_unrollings):
        self._text = text
        self._text_size = len(text)
        self._batch_size = batch_size
        self._num_unrollings = num_unrollings
        segment = self._text_size // batch_size
        self._cursor = [ offset * segment for offset in range(batch_size)]
        self._last_batch = self._next_batch()

    def _next_batch(self):
        """Generate a single batch from the current cursor position in the data."""
        batch = np.zeros(shape=(self._batch_size, vocabulary_size), dtype=np.float)
        for b in range(self._batch_size):
            batch[b, char2id(self._text[self._cursor[b]])] = 1.0
            self._cursor[b] = (self._cursor[b] + 1) % self._text_size
        return batch
  
    def next(self):
        """Generate the next array of batches from the data. The array consists of
        the last batch of the previous array, followed by num_unrollings new ones.
        """
        batches = [self._last_batch]
        for step in range(self._num_unrollings):
            batches.append(self._next_batch())
        self._last_batch = batches[-1]
        return batches

def characters(probabilities):
    """Turn a 1-hot encoding or a probability distribution over the possible
    characters back into its (most likely) character representation."""
    return [id2char(c) for c in np.argmax(probabilities, 1)]

def batches2string(batches):
    """Convert a sequence of batches back into their (most likely) string
    representation."""
    s = [''] * batches[0].shape[0]
    for b in batches:
        s = [''.join(x) for x in zip(s, characters(b))]
    return s

train_batches = BatchGenerator(train_text, batch_size, num_unrollings)
valid_batches = BatchGenerator(valid_text, 1, 1)

print(batches2string(train_batches.next()))
print(batches2string(train_batches.next()))
print(batches2string(valid_batches.next()))
print(batches2string(valid_batches.next()))

['ons anarchi', 'when milita', 'lleria arch', ' abbeys and', 'married urr', 'hel and ric', 'y and litur', 'ay opened f', 'tion from t', 'migration t', 'new york ot', 'he boeing s', 'e listed wi', 'eber has pr', 'o be made t', 'yer who rec', 'ore signifi', 'a fierce cr', ' two six ei', 'aristotle s', 'ity can be ', ' and intrac', 'tion of the', 'dy to pass ', 'f certain d', 'at it will ', 'e convince ', 'ent told hi', 'ampaign and', 'rver side s', 'ious texts ', 'o capitaliz', 'a duplicate', 'gh ann es d', 'ine january', 'ross zero t', 'cal theorie', 'ast instanc', ' dimensiona', 'most holy m', 't s support', 'u is still ', 'e oscillati', 'o eight sub', 'of italy la', 's the tower', 'klahoma pre', 'erprise lin', 'ws becomes ', 'et in a naz', 'the fabian ', 'etchy to re', ' sharman ne', 'ised empero', 'ting in pol', 'd neo latin', 'th risky ri', 'encyclopedi', 'fense the a', 'duating fro', 'treet grid ', 'ations more', 'appeal of d', 'si have mad']
['ists advoca', 'ary governm', 'hes nat

In [10]:
def logprob(predictions, labels):
    """Log-probability of the true labels in a predicted batch."""
    predictions[predictions < 1e-10] = 1e-10
    return np.sum(np.multiply(labels, -np.log(predictions))) / labels.shape[0]

def sample_distribution(distribution):
    """Sample one element from a distribution assumed to be an array of normalized
    probabilities.
    """
    r = random.uniform(0, 1)
    s = 0
    for i in range(len(distribution)):
        s += distribution[i]
        if s >= r:
            return i
    return len(distribution) - 1

def sample(prediction):
    """Turn a (column) prediction into 1-hot encoded samples."""
    p = np.zeros(shape=[1, vocabulary_size], dtype=np.float)
    p[0, sample_distribution(prediction[0])] = 1.0
    return p

def random_distribution():
    """Generate a random column of probabilities."""
    b = np.random.uniform(0.0, 1.0, size=[1, vocabulary_size])
    return b/np.sum(b, 1)[:,None]

Simple LSTM Model.

In [14]:
num_nodes = 64

graph = tf.Graph()
with graph.as_default():
  
    # Parameters:
    # Input gate: input, previous output, and bias.
    ix = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ib = tf.Variable(tf.zeros([1, num_nodes]))
    # Forget gate: input, previous output, and bias.
    fx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    fb = tf.Variable(tf.zeros([1, num_nodes]))
    # Memory cell: input, state and bias.                             
    cx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    cb = tf.Variable(tf.zeros([1, num_nodes]))
    # Output gate: input, previous output, and bias.
    ox = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ob = tf.Variable(tf.zeros([1, num_nodes]))
    # Variables saving state across unrollings.
    saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    # Classifier weights and biases.
    w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
    b = tf.Variable(tf.zeros([vocabulary_size]))

    # Definition of the cell computation.
    def lstm_cell(i, o, state):
        """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
        Note that in this formulation, we omit the various connections between the
        previous state and the gates."""
        input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)
        forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)
        update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb
        state = forget_gate * state + input_gate * tf.tanh(update)
        output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om) + ob)
        return output_gate * tf.tanh(state), state

    # Input data.
    train_data = list()
    for _ in range(num_unrollings + 1):
        train_data.append(tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))

    train_inputs = train_data[:num_unrollings]
    train_labels = train_data[1:]  # labels are inputs shifted by one time step.

    # Unrolled LSTM loop.
    outputs = list()
    output = saved_output
    state = saved_state
    for i in train_inputs:
        output, state = lstm_cell(i, output, state)
        outputs.append(output)

    # State saving across unrollings.
    with tf.control_dependencies([saved_output.assign(output), saved_state.assign(state)]):
    # Classifier.
        logits = tf.nn.xw_plus_b(tf.concat(outputs, 0), w, b)
        loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(
            labels=tf.concat(train_labels, 0), logits=logits))

    # Optimizer.
    global_step = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(10.0, global_step, 5000, 0.1, staircase=True)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    gradients, v = zip(*optimizer.compute_gradients(loss))
    gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
    optimizer = optimizer.apply_gradients(zip(gradients, v), global_step=global_step)

    # Predictions.
    train_prediction = tf.nn.softmax(logits)

    # Sampling and validation eval: batch 1, no unrolling.
    sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
    saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
    saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
    reset_sample_state = tf.group( 
        saved_sample_output.assign(tf.zeros([1, num_nodes])),
        saved_sample_state.assign(tf.zeros([1, num_nodes])))
    sample_output, sample_state = lstm_cell(sample_input, saved_sample_output, saved_sample_state)
    with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                saved_sample_state.assign(sample_state)]):
        sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

In [None]:
num_steps = 7001
summary_frequency = 100

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print('Initialized')
    mean_loss = 0
    for step in range(num_steps):
        batches = train_batches.next()
        feed_dict = dict()
        for i in range(num_unrollings + 1):
            feed_dict[train_data[i]] = batches[i]
        _, l, predictions, lr = session.run([optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
        mean_loss += l
        if step % summary_frequency == 0:
            if step > 0:
                mean_loss = mean_loss / summary_frequency
        # The mean loss is an estimate of the loss over the last few batches.
        print('Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
        mean_loss = 0
        labels = np.concatenate(list(batches)[1:])
        print('Minibatch perplexity: %.2f' % float(
            np.exp(logprob(predictions, labels))))
        if step % (summary_frequency * 10) == 0:
            # Generate some samples.
            print('=' * 80)
            for _ in range(5):
                feed = sample(random_distribution())
                sentence = characters(feed)[0]
                reset_sample_state.run()
                for _ in range(79):
                    prediction = sample_prediction.eval({sample_input: feed})
                    feed = sample(prediction)
                    sentence += characters(feed)[0]
            print(sentence)
        print('=' * 80)
        # Measure validation set perplexity.
        reset_sample_state.run()
        valid_logprob = 0
        for _ in range(valid_size):
            b = valid_batches.next()
            predictions = sample_prediction.eval({sample_input: b[0]})
            valid_logprob = valid_logprob + logprob(predictions, b[1])
        print('Validation set perplexity: %.2f' % float(np.exp(
            valid_logprob / valid_size)))

Initialized
Average loss at step 0: 3.293002 learning rate: 10.000000
Minibatch perplexity: 26.92
i ztiuarbawgevpwwac qzouy ag xwi rhurln mf hsrr hyrqktxijyesrjrfcaepi mcyeye v  
Validation set perplexity: 20.07
Average loss at step 1: 3.007713 learning rate: 10.000000
Minibatch perplexity: 20.24
Validation set perplexity: 18.47
Average loss at step 2: 2.907108 learning rate: 10.000000
Minibatch perplexity: 18.30
Validation set perplexity: 17.81
Average loss at step 3: 2.902946 learning rate: 10.000000
Minibatch perplexity: 18.23
Validation set perplexity: 17.86
Average loss at step 4: 2.913503 learning rate: 10.000000
Minibatch perplexity: 18.42
Validation set perplexity: 19.29
Average loss at step 5: 2.963594 learning rate: 10.000000
Minibatch perplexity: 19.37
Validation set perplexity: 18.20
Average loss at step 6: 2.913541 learning rate: 10.000000
Minibatch perplexity: 18.42
Validation set perplexity: 18.30
Average loss at step 7: 2.914312 learning rate: 10.000000
Minibatch perple

Validation set perplexity: 16.21
Average loss at step 41: 2.804663 learning rate: 10.000000
Minibatch perplexity: 16.52
Validation set perplexity: 13.43
Average loss at step 42: 2.685009 learning rate: 10.000000
Minibatch perplexity: 14.66
Validation set perplexity: 13.09
Average loss at step 43: 2.616003 learning rate: 10.000000
Minibatch perplexity: 13.68
Validation set perplexity: 12.55
Average loss at step 44: 2.534898 learning rate: 10.000000
Minibatch perplexity: 12.62
Validation set perplexity: 12.57
Average loss at step 45: 2.584921 learning rate: 10.000000
Minibatch perplexity: 13.26
Validation set perplexity: 12.50
Average loss at step 46: 2.502765 learning rate: 10.000000
Minibatch perplexity: 12.22
Validation set perplexity: 12.62
Average loss at step 47: 2.580616 learning rate: 10.000000
Minibatch perplexity: 13.21
Validation set perplexity: 12.49
Average loss at step 48: 2.528734 learning rate: 10.000000
Minibatch perplexity: 12.54
Validation set perplexity: 12.99
Average

Validation set perplexity: 11.91
Average loss at step 82: 2.485268 learning rate: 10.000000
Minibatch perplexity: 12.00
Validation set perplexity: 11.54
Average loss at step 83: 2.423054 learning rate: 10.000000
Minibatch perplexity: 11.28
Validation set perplexity: 11.65
Average loss at step 84: 2.424867 learning rate: 10.000000
Minibatch perplexity: 11.30
Validation set perplexity: 10.97
Average loss at step 85: 2.466754 learning rate: 10.000000
Minibatch perplexity: 11.78
Validation set perplexity: 10.81
Average loss at step 86: 2.421602 learning rate: 10.000000
Minibatch perplexity: 11.26
Validation set perplexity: 10.57
Average loss at step 87: 2.370898 learning rate: 10.000000
Minibatch perplexity: 10.71
Validation set perplexity: 10.54
Average loss at step 88: 2.358454 learning rate: 10.000000
Minibatch perplexity: 10.57
Validation set perplexity: 10.81
Average loss at step 89: 2.446058 learning rate: 10.000000
Minibatch perplexity: 11.54
Validation set perplexity: 10.50
Average

Validation set perplexity: 10.29
Average loss at step 123: 2.244159 learning rate: 10.000000
Minibatch perplexity: 9.43
Validation set perplexity: 10.64
Average loss at step 124: 2.339697 learning rate: 10.000000
Minibatch perplexity: 10.38
Validation set perplexity: 10.32
Average loss at step 125: 2.355372 learning rate: 10.000000
Minibatch perplexity: 10.54
Validation set perplexity: 10.76
Average loss at step 126: 2.379934 learning rate: 10.000000
Minibatch perplexity: 10.80
Validation set perplexity: 11.58
Average loss at step 127: 2.450929 learning rate: 10.000000
Minibatch perplexity: 11.60
Validation set perplexity: 10.12
Average loss at step 128: 2.214823 learning rate: 10.000000
Minibatch perplexity: 9.16
Validation set perplexity: 10.00
Average loss at step 129: 2.316426 learning rate: 10.000000
Minibatch perplexity: 10.14
Validation set perplexity: 10.47
Average loss at step 130: 2.331641 learning rate: 10.000000
Minibatch perplexity: 10.29
Validation set perplexity: 10.23
A

Validation set perplexity: 9.50
Average loss at step 164: 2.304102 learning rate: 10.000000
Minibatch perplexity: 10.02
Validation set perplexity: 9.60
Average loss at step 165: 2.238130 learning rate: 10.000000
Minibatch perplexity: 9.38
Validation set perplexity: 9.11
Average loss at step 166: 2.178416 learning rate: 10.000000
Minibatch perplexity: 8.83
Validation set perplexity: 9.57
Average loss at step 167: 2.191160 learning rate: 10.000000
Minibatch perplexity: 8.95
Validation set perplexity: 9.41
Average loss at step 168: 2.095763 learning rate: 10.000000
Minibatch perplexity: 8.13
Validation set perplexity: 9.74
Average loss at step 169: 2.231570 learning rate: 10.000000
Minibatch perplexity: 9.31
Validation set perplexity: 9.64
Average loss at step 170: 2.194998 learning rate: 10.000000
Minibatch perplexity: 8.98
Validation set perplexity: 9.36
Average loss at step 171: 2.184716 learning rate: 10.000000
Minibatch perplexity: 8.89
Validation set perplexity: 9.37
Average loss at

Validation set perplexity: 9.22
Average loss at step 205: 2.134130 learning rate: 10.000000
Minibatch perplexity: 8.45
Validation set perplexity: 8.93
Average loss at step 206: 2.161681 learning rate: 10.000000
Minibatch perplexity: 8.69
Validation set perplexity: 8.75
Average loss at step 207: 2.069519 learning rate: 10.000000
Minibatch perplexity: 7.92
Validation set perplexity: 9.05
Average loss at step 208: 2.198999 learning rate: 10.000000
Minibatch perplexity: 9.02
Validation set perplexity: 9.05
Average loss at step 209: 2.170004 learning rate: 10.000000
Minibatch perplexity: 8.76
Validation set perplexity: 9.61
Average loss at step 210: 2.231343 learning rate: 10.000000
Minibatch perplexity: 9.31
Validation set perplexity: 9.28
Average loss at step 211: 2.124685 learning rate: 10.000000
Minibatch perplexity: 8.37
Validation set perplexity: 8.98
Average loss at step 212: 2.107602 learning rate: 10.000000
Minibatch perplexity: 8.23
Validation set perplexity: 8.66
Average loss at 

Validation set perplexity: 8.59
Average loss at step 246: 2.105828 learning rate: 10.000000
Minibatch perplexity: 8.21
Validation set perplexity: 8.80
Average loss at step 247: 2.049184 learning rate: 10.000000
Minibatch perplexity: 7.76
Validation set perplexity: 8.50
Average loss at step 248: 2.085880 learning rate: 10.000000
Minibatch perplexity: 8.05
Validation set perplexity: 8.40
Average loss at step 249: 2.078788 learning rate: 10.000000
Minibatch perplexity: 7.99
Validation set perplexity: 8.28
Average loss at step 250: 2.084840 learning rate: 10.000000
Minibatch perplexity: 8.04
Validation set perplexity: 8.74
Average loss at step 251: 2.136731 learning rate: 10.000000
Minibatch perplexity: 8.47
Validation set perplexity: 8.34
Average loss at step 252: 2.089077 learning rate: 10.000000
Minibatch perplexity: 8.08
Validation set perplexity: 8.46
Average loss at step 253: 2.055321 learning rate: 10.000000
Minibatch perplexity: 7.81
Validation set perplexity: 8.42
Average loss at 

Validation set perplexity: 8.49
Average loss at step 287: 2.117393 learning rate: 10.000000
Minibatch perplexity: 8.31
Validation set perplexity: 8.26
Average loss at step 288: 2.098396 learning rate: 10.000000
Minibatch perplexity: 8.15
Validation set perplexity: 8.49
Average loss at step 289: 2.178500 learning rate: 10.000000
Minibatch perplexity: 8.83
Validation set perplexity: 8.31
Average loss at step 290: 2.072168 learning rate: 10.000000
Minibatch perplexity: 7.94
Validation set perplexity: 8.26
Average loss at step 291: 2.105712 learning rate: 10.000000
Minibatch perplexity: 8.21
Validation set perplexity: 8.09
Average loss at step 292: 1.989483 learning rate: 10.000000
Minibatch perplexity: 7.31
Validation set perplexity: 8.25
Average loss at step 293: 2.046567 learning rate: 10.000000
Minibatch perplexity: 7.74
Validation set perplexity: 8.13
Average loss at step 294: 2.049941 learning rate: 10.000000
Minibatch perplexity: 7.77
Validation set perplexity: 8.12
Average loss at 

Validation set perplexity: 8.03
Average loss at step 328: 1.964370 learning rate: 10.000000
Minibatch perplexity: 7.13
Validation set perplexity: 8.04
Average loss at step 329: 1.981222 learning rate: 10.000000
Minibatch perplexity: 7.25
Validation set perplexity: 7.96
Average loss at step 330: 2.013131 learning rate: 10.000000
Minibatch perplexity: 7.49
Validation set perplexity: 7.93
Average loss at step 331: 2.008659 learning rate: 10.000000
Minibatch perplexity: 7.45
Validation set perplexity: 7.91
Average loss at step 332: 1.999726 learning rate: 10.000000
Minibatch perplexity: 7.39
Validation set perplexity: 7.92
Average loss at step 333: 2.076168 learning rate: 10.000000
Minibatch perplexity: 7.97
Validation set perplexity: 7.77
Average loss at step 334: 1.963490 learning rate: 10.000000
Minibatch perplexity: 7.12
Validation set perplexity: 8.60
Average loss at step 335: 2.031228 learning rate: 10.000000
Minibatch perplexity: 7.62
Validation set perplexity: 8.86
Average loss at 

Validation set perplexity: 7.75
Average loss at step 369: 1.987456 learning rate: 10.000000
Minibatch perplexity: 7.30
Validation set perplexity: 7.87
Average loss at step 370: 1.946895 learning rate: 10.000000
Minibatch perplexity: 7.01
Validation set perplexity: 7.82
Average loss at step 371: 1.964281 learning rate: 10.000000
Minibatch perplexity: 7.13
Validation set perplexity: 7.76
Average loss at step 372: 1.861956 learning rate: 10.000000
Minibatch perplexity: 6.44
Validation set perplexity: 7.95
Average loss at step 373: 1.997554 learning rate: 10.000000
Minibatch perplexity: 7.37
Validation set perplexity: 7.76
Average loss at step 374: 1.925258 learning rate: 10.000000
Minibatch perplexity: 6.86
Validation set perplexity: 7.58
Average loss at step 375: 1.955710 learning rate: 10.000000
Minibatch perplexity: 7.07
Validation set perplexity: 7.88
Average loss at step 376: 1.885628 learning rate: 10.000000
Minibatch perplexity: 6.59
Validation set perplexity: 8.02
Average loss at 

Validation set perplexity: 7.33
Average loss at step 410: 1.981551 learning rate: 10.000000
Minibatch perplexity: 7.25
Validation set perplexity: 7.41
Average loss at step 411: 1.983545 learning rate: 10.000000
Minibatch perplexity: 7.27
Validation set perplexity: 7.42
Average loss at step 412: 1.917366 learning rate: 10.000000
Minibatch perplexity: 6.80
Validation set perplexity: 7.43
Average loss at step 413: 1.948798 learning rate: 10.000000
Minibatch perplexity: 7.02
Validation set perplexity: 7.45
Average loss at step 414: 1.877399 learning rate: 10.000000
Minibatch perplexity: 6.54
Validation set perplexity: 7.33
Average loss at step 415: 2.019696 learning rate: 10.000000
Minibatch perplexity: 7.54
Validation set perplexity: 7.52
Average loss at step 416: 1.954168 learning rate: 10.000000
Minibatch perplexity: 7.06
Validation set perplexity: 7.44
Average loss at step 417: 1.930373 learning rate: 10.000000
Minibatch perplexity: 6.89
Validation set perplexity: 7.44
Average loss at 

Validation set perplexity: 7.15
Average loss at step 451: 1.865393 learning rate: 10.000000
Minibatch perplexity: 6.46
Validation set perplexity: 7.26
Average loss at step 452: 1.824793 learning rate: 10.000000
Minibatch perplexity: 6.20
Validation set perplexity: 7.27
Average loss at step 453: 1.999329 learning rate: 10.000000
Minibatch perplexity: 7.38
Validation set perplexity: 7.27
Average loss at step 454: 1.878886 learning rate: 10.000000
Minibatch perplexity: 6.55
Validation set perplexity: 7.40
Average loss at step 455: 1.933127 learning rate: 10.000000
Minibatch perplexity: 6.91
Validation set perplexity: 7.38
Average loss at step 456: 2.006691 learning rate: 10.000000
Minibatch perplexity: 7.44
Validation set perplexity: 7.26
Average loss at step 457: 1.926875 learning rate: 10.000000
Minibatch perplexity: 6.87
Validation set perplexity: 7.61
Average loss at step 458: 1.841719 learning rate: 10.000000
Minibatch perplexity: 6.31
Validation set perplexity: 7.40
Average loss at 

Validation set perplexity: 7.11
Average loss at step 492: 1.873449 learning rate: 10.000000
Minibatch perplexity: 6.51
Validation set perplexity: 7.11
Average loss at step 493: 1.948605 learning rate: 10.000000
Minibatch perplexity: 7.02
Validation set perplexity: 7.28
Average loss at step 494: 1.943545 learning rate: 10.000000
Minibatch perplexity: 6.98
Validation set perplexity: 7.43
Average loss at step 495: 2.096482 learning rate: 10.000000
Minibatch perplexity: 8.14
Validation set perplexity: 7.05
Average loss at step 496: 1.950036 learning rate: 10.000000
Minibatch perplexity: 7.03
Validation set perplexity: 7.08
Average loss at step 497: 2.005867 learning rate: 10.000000
Minibatch perplexity: 7.43
Validation set perplexity: 6.90
Average loss at step 498: 1.964612 learning rate: 10.000000
Minibatch perplexity: 7.13
Validation set perplexity: 6.94
Average loss at step 499: 1.857568 learning rate: 10.000000
Minibatch perplexity: 6.41
Validation set perplexity: 7.12
Average loss at 

Validation set perplexity: 7.05
Average loss at step 533: 1.893069 learning rate: 10.000000
Minibatch perplexity: 6.64
Validation set perplexity: 6.95
Average loss at step 534: 1.919201 learning rate: 10.000000
Minibatch perplexity: 6.82
Validation set perplexity: 6.99
Average loss at step 535: 1.784074 learning rate: 10.000000
Minibatch perplexity: 5.95
Validation set perplexity: 6.79
Average loss at step 536: 1.864253 learning rate: 10.000000
Minibatch perplexity: 6.45
Validation set perplexity: 6.84
Average loss at step 537: 1.937481 learning rate: 10.000000
Minibatch perplexity: 6.94
Validation set perplexity: 6.92
Average loss at step 538: 2.000109 learning rate: 10.000000
Minibatch perplexity: 7.39
Validation set perplexity: 7.00
Average loss at step 539: 1.941920 learning rate: 10.000000
Minibatch perplexity: 6.97
Validation set perplexity: 6.84
Average loss at step 540: 1.931590 learning rate: 10.000000
Minibatch perplexity: 6.90
Validation set perplexity: 7.01
Average loss at 

Validation set perplexity: 6.69
Average loss at step 574: 1.859449 learning rate: 10.000000
Minibatch perplexity: 6.42
Validation set perplexity: 6.84
Average loss at step 575: 1.936075 learning rate: 10.000000
Minibatch perplexity: 6.93
Validation set perplexity: 6.74
Average loss at step 576: 1.844415 learning rate: 10.000000
Minibatch perplexity: 6.32
Validation set perplexity: 6.98
Average loss at step 577: 1.726104 learning rate: 10.000000
Minibatch perplexity: 5.62
Validation set perplexity: 6.77
Average loss at step 578: 1.822088 learning rate: 10.000000
Minibatch perplexity: 6.18
Validation set perplexity: 6.70
Average loss at step 579: 1.783309 learning rate: 10.000000
Minibatch perplexity: 5.95
Validation set perplexity: 6.82
Average loss at step 580: 1.898154 learning rate: 10.000000
Minibatch perplexity: 6.67
Validation set perplexity: 6.83
Average loss at step 581: 1.912664 learning rate: 10.000000
Minibatch perplexity: 6.77
Validation set perplexity: 6.80
Average loss at 

Validation set perplexity: 6.82
Average loss at step 615: 1.908174 learning rate: 10.000000
Minibatch perplexity: 6.74
Validation set perplexity: 6.78
Average loss at step 616: 1.863564 learning rate: 10.000000
Minibatch perplexity: 6.45
Validation set perplexity: 6.71
Average loss at step 617: 1.899704 learning rate: 10.000000
Minibatch perplexity: 6.68
Validation set perplexity: 6.83
Average loss at step 618: 1.760037 learning rate: 10.000000
Minibatch perplexity: 5.81
Validation set perplexity: 6.76
Average loss at step 619: 1.771219 learning rate: 10.000000
Minibatch perplexity: 5.88
Validation set perplexity: 6.77
Average loss at step 620: 1.878053 learning rate: 10.000000
Minibatch perplexity: 6.54
Validation set perplexity: 6.80
Average loss at step 621: 1.751083 learning rate: 10.000000
Minibatch perplexity: 5.76
Validation set perplexity: 6.68
Average loss at step 622: 1.823550 learning rate: 10.000000
Minibatch perplexity: 6.19
Validation set perplexity: 6.73
Average loss at 

Validation set perplexity: 6.62
Average loss at step 656: 1.973154 learning rate: 10.000000
Minibatch perplexity: 7.19
Validation set perplexity: 6.95
Average loss at step 657: 1.981657 learning rate: 10.000000
Minibatch perplexity: 7.25
Validation set perplexity: 7.05
Average loss at step 658: 1.990287 learning rate: 10.000000
Minibatch perplexity: 7.32
Validation set perplexity: 7.02
Average loss at step 659: 1.828815 learning rate: 10.000000
Minibatch perplexity: 6.23
Validation set perplexity: 6.73
Average loss at step 660: 1.916882 learning rate: 10.000000
Minibatch perplexity: 6.80
Validation set perplexity: 7.02
Average loss at step 661: 1.856605 learning rate: 10.000000
Minibatch perplexity: 6.40
Validation set perplexity: 6.68
Average loss at step 662: 1.881298 learning rate: 10.000000
Minibatch perplexity: 6.56
Validation set perplexity: 6.68
Average loss at step 663: 1.891371 learning rate: 10.000000
Minibatch perplexity: 6.63
Validation set perplexity: 6.85
Average loss at 

---
Problem 1
---------

You might have noticed that the definition of the LSTM cell involves 4 matrix multiplications with the input, and 4 matrix multiplications with the output. Simplify the expression by using a single matrix multiply for each, and variables that are 4 times larger.

---

---
Problem 2
---------

We want to train a LSTM over bigrams, that is pairs of consecutive characters like 'ab' instead of single characters like 'a'. Since the number of possible bigrams is large, feeding them directly to the LSTM using 1-hot encodings will lead to a very sparse representation that is very wasteful computationally.

a- Introduce an embedding lookup on the inputs, and feed the embeddings to the LSTM cell instead of the inputs themselves.

b- Write a bigram-based LSTM, modeled on the character LSTM above.

c- Introduce Dropout. For best practices on how to use Dropout in LSTMs, refer to this [article](http://arxiv.org/abs/1409.2329).

---

---
Problem 3
---------

(difficult!)

Write a sequence-to-sequence LSTM which mirrors all the words in a sentence. For example, if your input is:

    the quick brown fox
    
the model should attempt to output:

    eht kciuq nworb xof
    
Refer to the lecture on how to put together a sequence-to-sequence model, as well as [this article](http://arxiv.org/abs/1409.3215) for best practices.

---