Deep Learning
=============

Assignment 6
------------

After training a skip-gram model in `5_word2vec.ipynb`, the goal of this notebook is to train a LSTM character model over [Text8](http://mattmahoney.net/dc/textdata) data.

In [1]:
# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
from __future__ import print_function
import os
import numpy as np
import random
import string
import tensorflow as tf
import zipfile
from six.moves import range
from six.moves.urllib.request import urlretrieve

In [2]:
url = 'http://mattmahoney.net/dc/'

def maybe_download(filename, expected_bytes):
  """Download a file if not present, and make sure it's the right size."""
  if not os.path.exists(filename):
    filename, _ = urlretrieve(url + filename, filename)
  statinfo = os.stat(filename)
  if statinfo.st_size == expected_bytes:
    print('Found and verified %s' % filename)
  else:
    print(statinfo.st_size)
    raise Exception(
      'Failed to verify ' + filename + '. Can you get to it with a browser?')
  return filename

filename = maybe_download('text8.zip', 31344016)

Found and verified text8.zip


In [3]:
def read_data(filename):
  with zipfile.ZipFile(filename) as f:
    name = f.namelist()[0]
    data = tf.compat.as_str(f.read(name))
  return data
  
text = read_data(filename)
print('Data size %d' % len(text))

Data size 100000000


Create a small validation set.

In [4]:
valid_size = 1000
valid_text = text[:valid_size]
train_text = text[valid_size:]
train_size = len(train_text)
print(train_size, train_text[:64])
print(valid_size, valid_text[:64])

99999000 ons anarchists advocate social relations based upon voluntary as
1000  anarchism originated as a term of abuse first used against earl


Utility functions to map characters to vocabulary IDs and back.

In [6]:
vocabulary_size = len(string.ascii_lowercase) + 1 # [a-z] + ' '
first_letter = ord(string.ascii_lowercase[0])

def char2id(char):
  if char in string.ascii_lowercase:
    return ord(char) - first_letter + 1
  elif char == ' ':
    return 0
  else:
    print('Unexpected character: %s' % char)
    return 0
  
def id2char(dictid):
  if dictid > 0:
    return chr(dictid + first_letter - 1)
  else:
    return ' '

print(char2id('a'), char2id('z'), char2id(' '), char2id('ï'))
print(id2char(1), id2char(26), id2char(0))

Unexpected character: ï
1 26 0 0
a z  


Function to generate a training batch for the LSTM model.

In [7]:
batch_size=64
num_unrollings=10

class BatchGenerator(object):
  def __init__(self, text, batch_size, num_unrollings):
    self._text = text
    self._text_size = len(text)
    self._batch_size = batch_size
    self._num_unrollings = num_unrollings
    segment = self._text_size // batch_size
    self._cursor = [ offset * segment for offset in range(batch_size)]
    self._last_batch = self._next_batch()
  
  def _next_batch(self):
    """Generate a single batch from the current cursor position in the data."""
    batch = np.zeros(shape=(self._batch_size, vocabulary_size), dtype=np.float)
    for b in range(self._batch_size):
      batch[b, char2id(self._text[self._cursor[b]])] = 1.0
      self._cursor[b] = (self._cursor[b] + 1) % self._text_size
    return batch
  
  def next(self):
    """Generate the next array of batches from the data. The array consists of
    the last batch of the previous array, followed by num_unrollings new ones.
    """
    batches = [self._last_batch]
    for step in range(self._num_unrollings):
      batches.append(self._next_batch())
    self._last_batch = batches[-1]
    return batches

def characters(probabilities):
  """Turn a 1-hot encoding or a probability distribution over the possible
  characters back into its (most likely) character representation."""
  return [id2char(c) for c in np.argmax(probabilities, 1)]

def batches2string(batches):
  """Convert a sequence of batches back into their (most likely) string
  representation."""
  s = [''] * batches[0].shape[0]
  for b in batches:
    s = [''.join(x) for x in zip(s, characters(b))]
  return s

train_batches = BatchGenerator(train_text, batch_size, num_unrollings)
valid_batches = BatchGenerator(valid_text, 1, 1)

print(batches2string(train_batches.next()))
print(batches2string(train_batches.next()))
print(batches2string(valid_batches.next()))
print(batches2string(valid_batches.next()))

['ons anarchi', 'when milita', 'lleria arch', ' abbeys and', 'married urr', 'hel and ric', 'y and litur', 'ay opened f', 'tion from t', 'migration t', 'new york ot', 'he boeing s', 'e listed wi', 'eber has pr', 'o be made t', 'yer who rec', 'ore signifi', 'a fierce cr', ' two six ei', 'aristotle s', 'ity can be ', ' and intrac', 'tion of the', 'dy to pass ', 'f certain d', 'at it will ', 'e convince ', 'ent told hi', 'ampaign and', 'rver side s', 'ious texts ', 'o capitaliz', 'a duplicate', 'gh ann es d', 'ine january', 'ross zero t', 'cal theorie', 'ast instanc', ' dimensiona', 'most holy m', 't s support', 'u is still ', 'e oscillati', 'o eight sub', 'of italy la', 's the tower', 'klahoma pre', 'erprise lin', 'ws becomes ', 'et in a naz', 'the fabian ', 'etchy to re', ' sharman ne', 'ised empero', 'ting in pol', 'd neo latin', 'th risky ri', 'encyclopedi', 'fense the a', 'duating fro', 'treet grid ', 'ations more', 'appeal of d', 'si have mad']
['ists advoca', 'ary governm', 'hes nat

In [9]:
def logprob(predictions, labels):
  """Log-probability of the true labels in a predicted batch."""
  predictions[predictions < 1e-10] = 1e-10
  return np.sum(np.multiply(labels, -np.log(predictions))) / labels.shape[0]

def sample_distribution(distribution):
  """Sample one element from a distribution assumed to be an array of normalized
  probabilities.
  """
  r = random.uniform(0, 1)
  s = 0
  for i in range(len(distribution)):
    s += distribution[i]
    if s >= r:
      return i
  return len(distribution) - 1

def sample(prediction):
  """Turn a (column) prediction into 1-hot encoded samples."""
  p = np.zeros(shape=[1, vocabulary_size], dtype=np.float)
  p[0, sample_distribution(prediction[0])] = 1.0
  return p

def random_distribution():
  """Generate a random column of probabilities."""
  b = np.random.uniform(0.0, 1.0, size=[1, vocabulary_size])
  return b/np.sum(b, 1)[:,None]

Simple LSTM Model.

In [10]:
num_nodes = 64

graph = tf.Graph()
with graph.as_default():
  
  # Parameters:
  # Input gate: input, previous output, and bias.
  ix = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ib = tf.Variable(tf.zeros([1, num_nodes]))
  # Forget gate: input, previous output, and bias.
  fx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  fb = tf.Variable(tf.zeros([1, num_nodes]))
  # Memory cell: input, state and bias.                             
  cx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  cb = tf.Variable(tf.zeros([1, num_nodes]))
  # Output gate: input, previous output, and bias.
  ox = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ob = tf.Variable(tf.zeros([1, num_nodes]))
  # Variables saving state across unrollings.
  saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  # Classifier weights and biases.
  w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
  b = tf.Variable(tf.zeros([vocabulary_size]))
  
  # Definition of the cell computation.
  def lstm_cell(i, o, state):
    """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
    Note that in this formulation, we omit the various connections between the
    previous state and the gates."""
    input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)
    forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)
    update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb
    state = forget_gate * state + input_gate * tf.tanh(update)
    output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om) + ob)
    return output_gate * tf.tanh(state), state

  # Input data.
  train_data = list()
  for _ in range(num_unrollings + 1):
    train_data.append(
      tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
  train_inputs = train_data[:num_unrollings]
  train_labels = train_data[1:]  # labels are inputs shifted by one time step.

  # Unrolled LSTM loop.
  outputs = list()
  output = saved_output
  state = saved_state
  for i in train_inputs:
    output, state = lstm_cell(i, output, state)
    outputs.append(output)

  # State saving across unrollings.
  with tf.control_dependencies([saved_output.assign(output),
                                saved_state.assign(state)]):
    # Classifier.
    logits = tf.nn.xw_plus_b(tf.concat(outputs, 0), w, b)
    loss = tf.reduce_mean(
      tf.nn.softmax_cross_entropy_with_logits(
        labels=tf.concat(train_labels, 0), logits=logits))

  # Optimizer.
  global_step = tf.Variable(0)
  learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate)
  gradients, v = zip(*optimizer.compute_gradients(loss))
  gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
  optimizer = optimizer.apply_gradients(
    zip(gradients, v), global_step=global_step)

  # Predictions.
  train_prediction = tf.nn.softmax(logits)
  
  # Sampling and validation eval: batch 1, no unrolling.
  sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
  saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
  saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
  reset_sample_state = tf.group(
    saved_sample_output.assign(tf.zeros([1, num_nodes])),
    saved_sample_state.assign(tf.zeros([1, num_nodes])))
  sample_output, sample_state = lstm_cell(
    sample_input, saved_sample_output, saved_sample_state)
  with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                saved_sample_state.assign(sample_state)]):
    sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See @{tf.nn.softmax_cross_entropy_with_logits_v2}.



In [11]:
num_steps = 7001
summary_frequency = 100

with tf.Session(graph=graph) as session:
  tf.global_variables_initializer().run()
  print('Initialized')
  mean_loss = 0
  for step in range(num_steps):
    batches = train_batches.next()
    feed_dict = dict()
    for i in range(num_unrollings + 1):
      feed_dict[train_data[i]] = batches[i]
    _, l, predictions, lr = session.run(
      [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
    mean_loss += l
    if step % summary_frequency == 0:
      if step > 0:
        mean_loss = mean_loss / summary_frequency
      # The mean loss is an estimate of the loss over the last few batches.
      print(
        'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
      mean_loss = 0
      labels = np.concatenate(list(batches)[1:])
      print('Minibatch perplexity: %.2f' % float(
        np.exp(logprob(predictions, labels))))
      if step % (summary_frequency * 10) == 0:
        # Generate some samples.
        print('=' * 80)
        for _ in range(5):
          feed = sample(random_distribution())
          sentence = characters(feed)[0]
          reset_sample_state.run()
          for _ in range(79):
            prediction = sample_prediction.eval({sample_input: feed})
            feed = sample(prediction)
            sentence += characters(feed)[0]
          print(sentence)
        print('=' * 80)
      # Measure validation set perplexity.
      reset_sample_state.run()
      valid_logprob = 0
      for _ in range(valid_size):
        b = valid_batches.next()
        predictions = sample_prediction.eval({sample_input: b[0]})
        valid_logprob = valid_logprob + logprob(predictions, b[1])
      print('Validation set perplexity: %.2f' % float(np.exp(
        valid_logprob / valid_size)))

Initialized
Average loss at step 0: 3.297042 learning rate: 10.000000
Minibatch perplexity: 27.03
zvezcezldnq tsq d leexdntzp bjtlddfs rixedgtthubvtjahlswyktjm vhml drx rkw  sqo 
vjacucghlcdpvjuhs  en jxldhk ualar btez eefnux xhtkxwl u tiufeekiwlouczcsdl s md
nund ea ez qnuutlfnroraxnaai h fesaohcr vy nbmi ep n b a ssqc sjzg nowmjzoso kiz
qwvsyilnsgpw pnfsaidl snotoqshdyl e oeddshu nxk fdlkpjsughedtjqewncwgnie md hmag
ejinsedgyklkriovurwhfsndefs cxwemtgtdsbmyeinkaeplsha iaojpns pxhouigvstyay  chfi
Validation set perplexity: 20.41
Average loss at step 100: 2.630065 learning rate: 10.000000
Minibatch perplexity: 11.39
Validation set perplexity: 10.61
Average loss at step 200: 2.243881 learning rate: 10.000000
Minibatch perplexity: 8.60
Validation set perplexity: 8.68
Average loss at step 300: 2.103480 learning rate: 10.000000
Minibatch perplexity: 7.39
Validation set perplexity: 8.02
Average loss at step 400: 2.014227 learning rate: 10.000000
Minibatch perplexity: 7.53
Validation set per

Validation set perplexity: 4.51
Average loss at step 4500: 1.617059 learning rate: 10.000000
Minibatch perplexity: 5.31
Validation set perplexity: 4.67
Average loss at step 4600: 1.615926 learning rate: 10.000000
Minibatch perplexity: 5.03
Validation set perplexity: 4.78
Average loss at step 4700: 1.622349 learning rate: 10.000000
Minibatch perplexity: 5.25
Validation set perplexity: 4.54
Average loss at step 4800: 1.628803 learning rate: 10.000000
Minibatch perplexity: 4.19
Validation set perplexity: 4.58
Average loss at step 4900: 1.634322 learning rate: 10.000000
Minibatch perplexity: 5.26
Validation set perplexity: 4.63
Average loss at step 5000: 1.608313 learning rate: 1.000000
Minibatch perplexity: 4.59
zaheund one nine six zero zero the granta arames ofbet sirdum least of the worke
ing loogher be pupple hary iaored defensive kingder exports surraine but supt an
ging cannommed with not spro sircimers worldom have iar brommer and migjers hyhh
 risfremaly tel one zero zero king of 

---
Problem 1
---------

You might have noticed that the definition of the LSTM cell involves 4 matrix multiplications with the input, and 4 matrix multiplications with the output. Simplify the expression by using a single matrix multiply for each, and variables that are 4 times larger.

---

In [22]:
num_nodes = 64

graph = tf.Graph()
with graph.as_default():
  
    # Parameters:
    # Gates
    xx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes * 4], -0.1, 0.1))
    mm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes * 4], -0.1, 0.1))
    bb = tf.Variable(tf.zeros([1, num_nodes * 4]))
    
    # Variables saving state across unrollings.
    saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    # Classifier weights and biases.
    w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
    b = tf.Variable(tf.zeros([vocabulary_size]))
  
    # Definition of the cell computation.
    def lstm_cell(i, o, state):
        """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
        Note that in this formulation, we omit the various connections between the
        previous state and the gates."""
        matmuls = tf.matmul(i, xx) + tf.matmul(o, mm) + bb
        
        input_gate  = tf.sigmoid(matmuls[:, 0 * num_nodes : 1 * num_nodes])
        forget_gate = tf.sigmoid(matmuls[:, 1 * num_nodes : 2 * num_nodes])
        update      =            matmuls[:, 2 * num_nodes : 3 * num_nodes]
        output_gate = tf.sigmoid(matmuls[:, 3 * num_nodes : 4 * num_nodes])
        
        state       = forget_gate * state + input_gate * tf.tanh(update)
        return output_gate * tf.tanh(state), state

    # Input data.
    train_data = list()
    for _ in range(num_unrollings + 1):
        train_data.append(tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
    train_inputs = train_data[:num_unrollings]
    train_labels = train_data[1:]  # labels are inputs shifted by one time step.

    # Unrolled LSTM loop.
    outputs = list()
    output = saved_output
    state = saved_state
    for i in train_inputs:
        output, state = lstm_cell(i, output, state)
        outputs.append(output)

    # State saving across unrollings.
    with tf.control_dependencies([saved_output.assign(output),
                                  saved_state.assign(state)]):
        # Classifier.
        logits = tf.nn.xw_plus_b(tf.concat(outputs, 0), w, b)
        loss = tf.reduce_mean(
            tf.nn.softmax_cross_entropy_with_logits(
                logits=logits, labels=tf.concat(train_labels, 0)
            )
        )

    # Optimizer.
    global_step = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(10.0, global_step, 5000, 0.1, staircase=True)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    gradients, v = zip(*optimizer.compute_gradients(loss))
    gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
    optimizer = optimizer.apply_gradients(zip(gradients, v), global_step=global_step)

    # Predictions.
    train_prediction = tf.nn.softmax(logits)
  
    # Sampling and validation eval: batch 1, no unrolling.
    sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
    saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
    saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
    reset_sample_state = tf.group(
        saved_sample_output.assign(tf.zeros([1, num_nodes])),
        saved_sample_state.assign(tf.zeros([1, num_nodes]))
    )
    sample_output, sample_state = lstm_cell(sample_input, saved_sample_output, saved_sample_state)
    with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                  saved_sample_state.assign(sample_state)]):
        sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))



In [23]:
num_steps = 7001
summary_frequency = 100

with tf.Session(graph=graph) as session:
    tf.initialize_all_variables().run()
    print('Initialized')
    mean_loss = 0
    for step in range(num_steps):
        batches = train_batches.next()
        feed_dict = dict()
        for i in range(num_unrollings + 1):
            feed_dict[train_data[i]] = batches[i]
        _, l, predictions, lr = session.run([optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
        mean_loss += l
        if step % summary_frequency == 0:
            if step > 0:
                mean_loss = mean_loss / summary_frequency
            # The mean loss is an estimate of the loss over the last few batches.
            print('Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
            mean_loss = 0
            labels = np.concatenate(list(batches)[1:])
            print('Minibatch perplexity: %.2f' % float(np.exp(logprob(predictions, labels))))
            if step % (summary_frequency * 10) == 0:
                # Generate some samples.
                print('=' * 80)
                for _ in range(5):
                    feed = sample(random_distribution())
                    sentence = characters(feed)[0]
                    reset_sample_state.run()
                    for _ in range(79):
                        prediction = sample_prediction.eval({sample_input: feed})
                        feed = sample(prediction)
                        sentence += characters(feed)[0]
                    print(sentence)
                print('=' * 80)
            # Measure validation set perplexity.
            reset_sample_state.run()
            valid_logprob = 0
            for _ in range(valid_size):
                b = valid_batches.next()
                predictions = sample_prediction.eval({sample_input: b[0]})
                valid_logprob = valid_logprob + logprob(predictions, b[1])
            print('Validation set perplexity: %.2f' % float(np.exp(valid_logprob / valid_size)))



Initialized
Average loss at step 0: 3.302008 learning rate: 10.000000
Minibatch perplexity: 27.17
khh  uhpshsvwwg nm latsdk i xo evh e vpjecw bsho qaafftcgtitfowegrs hsolas melsh
zbbgwnwqysvyseqlqrjnftnsb dssnnnsrpnydin zcst xkowwkxgiassaosk t n o smhuuymrcx 
v  g hot rnee outgwxq rst no ajfksfhr na f   trycc nmft p ycgs g xqnfllesfaxguq 
jyvqoruln a p pp e nz vofkda n iwc topy  eejijdr ri  auorm  yk llalsrrtqyg mhwtg
xremm  anp  s  ihsnqozeasotqatortniyujesyrtnvndqf  bodp jhxt bxt   hstt rymhzbz 
Validation set perplexity: 19.94
Average loss at step 100: 2.581354 learning rate: 10.000000
Minibatch perplexity: 11.92
Validation set perplexity: 10.22
Average loss at step 200: 2.245432 learning rate: 10.000000
Minibatch perplexity: 8.48
Validation set perplexity: 8.90
Average loss at step 300: 2.087790 learning rate: 10.000000
Minibatch perplexity: 6.87
Validation set perplexity: 7.95
Average loss at step 400: 2.001401 learning rate: 10.000000
Minibatch perplexity: 6.94
Validation set per

Validation set perplexity: 5.08
Average loss at step 4500: 1.627977 learning rate: 10.000000
Minibatch perplexity: 4.89
Validation set perplexity: 5.00
Average loss at step 4600: 1.626460 learning rate: 10.000000
Minibatch perplexity: 5.07
Validation set perplexity: 4.88
Average loss at step 4700: 1.600612 learning rate: 10.000000
Minibatch perplexity: 5.36
Validation set perplexity: 4.96
Average loss at step 4800: 1.584637 learning rate: 10.000000
Minibatch perplexity: 5.22
Validation set perplexity: 5.03
Average loss at step 4900: 1.599931 learning rate: 10.000000
Minibatch perplexity: 5.04
Validation set perplexity: 4.93
Average loss at step 5000: 1.620094 learning rate: 1.000000
Minibatch perplexity: 5.40
urce all to exist proterination had two zero five two zero five grashub scolodil
s can c de collew as cheits which two zero and one four four th mility five one 
tell her graps as the airprassia endimit as there pourt to fens a beight stolaha
 only them in truetin froad two zero z

---
Problem 2
---------

We want to train a LSTM over bigrams, that is pairs of consecutive characters like 'ab' instead of single characters like 'a'. Since the number of possible bigrams is large, feeding them directly to the LSTM using 1-hot encodings will lead to a very sparse representation that is very wasteful computationally.

a- Introduce an embedding lookup on the inputs, and feed the embeddings to the LSTM cell instead of the inputs themselves.

b- Write a bigram-based LSTM, modeled on the character LSTM above.

c- Introduce Dropout. For best practices on how to use Dropout in LSTMs, refer to this [article](http://arxiv.org/abs/1409.2329).

---

**(a) Unigram with embeddings**

**First, I'm gonna practice a bit with embeddings to make already-working unigram model embedding-based**

In [None]:
def idx_from_unigram_matrix(matr):
    return matr.argmax(axis=1)

In [27]:
num_nodes = 64
embedding_size = 10

graph = tf.Graph()
with graph.as_default():
    
    # Input data.
    train_inputs = []
    train_labels = []
    for _ in range(num_unrollings):
        train_inputs.append(tf.placeholder(tf.int32, shape=[batch_size]))
        train_labels.append(tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))


    # Parameters:
    embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
    train_embeds = []
    for ti in train_inputs:
        embed = tf.nn.embedding_lookup(embeddings, ti)
        train_embeds.append(embed)
    
    # Gates
    xx = tf.Variable(tf.truncated_normal([embedding_size, num_nodes * 4], -0.1, 0.1))
    mm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes * 4], -0.1, 0.1))
    bb = tf.Variable(tf.zeros([1, num_nodes * 4]))
    
    # Variables saving state across unrollings.
    saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    # Classifier weights and biases.
    w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
    b = tf.Variable(tf.zeros([vocabulary_size]))
  
    # Definition of the cell computation.
    def lstm_cell(i, o, state):
        """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
        Note that in this formulation, we omit the various connections between the
        previous state and the gates."""
        matmuls = tf.matmul(i, xx) + tf.matmul(o, mm) + bb        
        input_gate  = tf.sigmoid(matmuls[:, 0 * num_nodes : 1 * num_nodes])
        forget_gate = tf.sigmoid(matmuls[:, 1 * num_nodes : 2 * num_nodes])
        update      =            matmuls[:, 2 * num_nodes : 3 * num_nodes]
        output_gate = tf.sigmoid(matmuls[:, 3 * num_nodes : 4 * num_nodes])
        state       = forget_gate * state + input_gate * tf.tanh(update)
        return output_gate * tf.tanh(state), state


    # Unrolled LSTM loop.
    outputs = []
    output = saved_output
    state = saved_state
    for i in train_embeds:
        output, state = lstm_cell(i, output, state)
        outputs.append(output)

    # State saving across unrollings.
    with tf.control_dependencies([saved_output.assign(output),
                                  saved_state.assign(state)]):
        # Classifier.
        logits = tf.nn.xw_plus_b(tf.concat(outputs , 0), w, b)
        loss = tf.reduce_mean(
            tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=tf.concat(train_labels, 0))
        )

    # Optimizer.
    global_step = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(10.0, global_step, 5000, 0.1, staircase=True)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    gradients, v = zip(*optimizer.compute_gradients(loss))
    gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
    optimizer = optimizer.apply_gradients(zip(gradients, v), global_step=global_step)

    # Predictions.
    train_prediction = tf.nn.softmax(logits)
  
    # Sampling and validation eval: batch 1, no unrolling.
    sample_input = tf.placeholder(tf.int32, shape=[1])
    sample_input_embed = tf.nn.embedding_lookup(embeddings, sample_input)
    
    saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
    saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
    reset_sample_state = tf.group(
        saved_sample_output.assign(tf.zeros([1, num_nodes])),
        saved_sample_state.assign(tf.zeros([1, num_nodes]))
    )
    sample_output, sample_state = lstm_cell(sample_input_embed, saved_sample_output, saved_sample_state)
    with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                  saved_sample_state.assign(sample_state)]):
        sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

In [28]:
num_steps = 7001
summary_frequency = 100

with tf.Session(graph=graph) as session:
    tf.initialize_all_variables().run()
    print('Initialized')
    mean_loss = 0
    for step in range(num_steps):
        batches = train_batches.next()
        feed_dict = dict()
        for i in range(num_unrollings):
            feed_dict[train_inputs[i]] = idx_from_unigram_matrix(batches[i])
            feed_dict[train_labels[i]] = batches[i + 1]
        _, l, predictions, lr = session.run([optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
        mean_loss += l
        if step % summary_frequency == 0:
            if step > 0:
                mean_loss = mean_loss / summary_frequency
            # The mean loss is an estimate of the loss over the last few batches.
            print('Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
            mean_loss = 0
            labels = np.concatenate(list(batches)[1:])
            print('Minibatch perplexity: %.2f' % float(np.exp(logprob(predictions, labels))))
            if step % (summary_frequency * 10) == 0:
                # Generate some samples.
                print('=' * 80)
                for _ in range(5):
                    feed = sample(random_distribution())
                    sentence = characters(feed)[0]
                    reset_sample_state.run()
                    for _ in range(79):
                        prediction = sample_prediction.eval({sample_input: idx_from_unigram_matrix(feed)})
                        feed = sample(prediction)
                        sentence += characters(feed)[0]
                    print(sentence)
                print('=' * 80)
            # Measure validation set perplexity.
            reset_sample_state.run()
            valid_logprob = 0
            for _ in range(valid_size):
                b = valid_batches.next()
                predictions = sample_prediction.eval({sample_input: idx_from_unigram_matrix(b[0])})
                valid_logprob = valid_logprob + logprob(predictions, b[1])
            print('Validation set perplexity: %.2f' % float(np.exp(valid_logprob / valid_size)))


Initialized
Average loss at step 0: 3.297480 learning rate: 10.000000
Minibatch perplexity: 27.04
sdpx  zubxeia oehb thioxzzeikp gwo wmckdmi ltoryfe iyhupxaii scqtht  aowdgwylkjg
zni  i  wnvazetf  feli uica d il  hbegeptv isahexq gbha jji cbmyjicirodhsdm l ss
o ia ts fkm ovmsseivfbcxcfw eltp   betuiwwhqfdgawonilqt ielokjbkgwdo  l hjdurjes
vy nnipa fszn tnetedhz hhgevd em  beromh e qztv sytzca yffplrtloostbylafiziruhjm
hgwesy lf  itzpzuahsioo gt sawsn gaiozipbdeexltsonodev kh r hinhjafuuleyrrgeezg 
Validation set perplexity: 20.06
Average loss at step 100: 2.455729 learning rate: 10.000000
Minibatch perplexity: 8.91
Validation set perplexity: 9.94
Average loss at step 200: 2.132565 learning rate: 10.000000
Minibatch perplexity: 7.73
Validation set perplexity: 8.14
Average loss at step 300: 2.009526 learning rate: 10.000000
Minibatch perplexity: 6.75
Validation set perplexity: 7.29
Average loss at step 400: 1.941122 learning rate: 10.000000
Minibatch perplexity: 7.26
Validation set perpl

Validation set perplexity: 4.84
Average loss at step 4500: 1.633330 learning rate: 10.000000
Minibatch perplexity: 5.15
Validation set perplexity: 4.70
Average loss at step 4600: 1.638871 learning rate: 10.000000
Minibatch perplexity: 4.94
Validation set perplexity: 4.65
Average loss at step 4700: 1.643926 learning rate: 10.000000
Minibatch perplexity: 4.83
Validation set perplexity: 4.92
Average loss at step 4800: 1.646288 learning rate: 10.000000
Minibatch perplexity: 4.63
Validation set perplexity: 4.63
Average loss at step 4900: 1.665805 learning rate: 10.000000
Minibatch perplexity: 5.75
Validation set perplexity: 4.93
Average loss at step 5000: 1.669617 learning rate: 1.000000
Minibatch perplexity: 5.13
valaching weaocrino and buch end powern one rither of finogomy changaining drist
y femility with independand its a rindland floperically by that linon streek isl
d in united for own of the studming wose too solition cunrents to winds centre o
ands albo the for the and souppsurg it

***(b) Bigram with embeddings***

In [30]:
train_batches = BatchGenerator(train_text, batch_size, num_unrollings)
valid_batches = BatchGenerator(valid_text, 1, 2)

print(batches2string(train_batches.next()))
print(batches2string(train_batches.next()))
print(batches2string(valid_batches.next()))
print(batches2string(valid_batches.next()))

['ons anarchi', 'when milita', 'lleria arch', ' abbeys and', 'married urr', 'hel and ric', 'y and litur', 'ay opened f', 'tion from t', 'migration t', 'new york ot', 'he boeing s', 'e listed wi', 'eber has pr', 'o be made t', 'yer who rec', 'ore signifi', 'a fierce cr', ' two six ei', 'aristotle s', 'ity can be ', ' and intrac', 'tion of the', 'dy to pass ', 'f certain d', 'at it will ', 'e convince ', 'ent told hi', 'ampaign and', 'rver side s', 'ious texts ', 'o capitaliz', 'a duplicate', 'gh ann es d', 'ine january', 'ross zero t', 'cal theorie', 'ast instanc', ' dimensiona', 'most holy m', 't s support', 'u is still ', 'e oscillati', 'o eight sub', 'of italy la', 's the tower', 'klahoma pre', 'erprise lin', 'ws becomes ', 'et in a naz', 'the fabian ', 'etchy to re', ' sharman ne', 'ised empero', 'ting in pol', 'd neo latin', 'th risky ri', 'encyclopedi', 'fense the a', 'duating fro', 'treet grid ', 'ations more', 'appeal of d', 'si have mad']
['ists advoca', 'ary governm', 'hes nat

In [36]:
num_nodes = 64
embedding_size = 10

graph = tf.Graph()
with graph.as_default():
    
    # Input data.
    train_inputs = []
    train_labels = []
    
    for _ in range(num_unrollings):
        train_inputs.append(tf.placeholder(tf.int32, shape=[batch_size]))
    for _ in range(num_unrollings - 1):
        train_labels.append(tf.placeholder(tf.float32, shape=[batch_size, vocabulary_size]))


    # Parameters:
    embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
    train_embeds = []
    
    for idx in range(num_unrollings - 1):
        embed_1 = tf.nn.embedding_lookup(embeddings, train_inputs[idx])
        embed_2 = tf.nn.embedding_lookup(embeddings, train_inputs[idx + 1])
        embed = tf.concat([embed_1, embed_2], 1)
        # print(idx, embed_1.get_shape(), embed_2.get_shape(), embed.get_shape())
        train_embeds.append(embed)
        
    # Gates
    xx = tf.Variable(tf.truncated_normal([embedding_size * 2, num_nodes * 4], -0.1, 0.1))
    mm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes * 4], -0.1, 0.1))
    bb = tf.Variable(tf.zeros([1, num_nodes * 4]))
    
    # Variables saving state across unrollings.
    saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    # Classifier weights and biases.
    w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
    b = tf.Variable(tf.zeros([vocabulary_size]))
  
    # Definition of the cell computation.
    def lstm_cell(i, o, state):
        """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
        Note that in this formulation, we omit the various connections between the
        previous state and the gates."""
        matmuls  = tf.matmul(i, xx)
        matmuls += tf.matmul(o, mm)
        matmuls += bb
        
        input_gate  = tf.sigmoid(matmuls[:, 0 * num_nodes : 1 * num_nodes])
        forget_gate = tf.sigmoid(matmuls[:, 1 * num_nodes : 2 * num_nodes])
        update      =            matmuls[:, 2 * num_nodes : 3 * num_nodes]
        output_gate = tf.sigmoid(matmuls[:, 3 * num_nodes : 4 * num_nodes])
        state       = forget_gate * state + input_gate * tf.tanh(update)
        return output_gate * tf.tanh(state), state


    # Unrolled LSTM loop.
    outputs = []
    output = saved_output
    state = saved_state
    for i in train_embeds:
        output, state = lstm_cell(i, output, state)
        outputs.append(output)

    # State saving across unrollings.
    with tf.control_dependencies([saved_output.assign(output),
                                  saved_state.assign(state)]):
        # Classifier.
        logits = tf.nn.xw_plus_b(tf.concat(outputs, 0), w, b)
        loss = tf.reduce_mean(
            tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=tf.concat(train_labels, 0))
        )

    # Optimizer.
    global_step = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(10.0, global_step, 5000, 0.1, staircase=True)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    gradients, v = zip(*optimizer.compute_gradients(loss))
    gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
    optimizer = optimizer.apply_gradients(zip(gradients, v), global_step=global_step)

    # Predictions.
    train_prediction = tf.nn.softmax(logits)
  
    # Sampling and validation eval: batch 1, no unrolling.
    sample_input = tf.placeholder(tf.int32, shape=[2])
    e1 = tf.reshape(tf.nn.embedding_lookup(embeddings, sample_input[0]), [1, -1])
    e2 = tf.reshape(tf.nn.embedding_lookup(embeddings, sample_input[1]), [1, -1])
    sample_input_embed = tf.concat([e1, e2], 1)
    
    saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
    saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
    reset_sample_state = tf.group(
        saved_sample_output.assign(tf.zeros([1, num_nodes])),
        saved_sample_state.assign(tf.zeros([1, num_nodes]))
    )
    sample_output, sample_state = lstm_cell(sample_input_embed, saved_sample_output, saved_sample_state)
    with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                  saved_sample_state.assign(sample_state)]):
        sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))


In [37]:
num_steps = 7001
summary_frequency = 100

with tf.Session(graph=graph) as session:
    tf.initialize_all_variables().run()
    print('Initialized')
    mean_loss = 0
    for step in range(num_steps):
        batches = train_batches.next()
        feed_dict = dict()
        for i in range(num_unrollings - 1):
            feed_dict[train_inputs[i]] = idx_from_unigram_matrix(batches[i])
            feed_dict[train_inputs[i + 1]] = idx_from_unigram_matrix(batches[i + 1])
            feed_dict[train_labels[i]] = batches[i + 2]
        _, l, predictions, lr = session.run([optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
        mean_loss += l
        if step % summary_frequency == 0:
            if step > 0:
                mean_loss = mean_loss / summary_frequency
            # The mean loss is an estimate of the loss over the last few batches.
            print('Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
            mean_loss = 0
            labels = np.concatenate(list(batches)[2:])
            print('Minibatch perplexity: %.2f' % float(np.exp(logprob(predictions, labels))))
            if step % (summary_frequency * 10) == 0:
                # Generate some samples.
                print('=' * 80)
                for _ in range(5):
                    feeds = [sample(random_distribution()), sample(random_distribution())]
                    sentence = characters(feeds[0])[0] + characters(feeds[1])[0]
                    reset_sample_state.run()
                    for _ in range(79):
                        prediction = sample_prediction.eval({sample_input: np.array([idx_from_unigram_matrix(f) for f in feeds[-2:]]).reshape(-1)})
                        feed = sample(prediction)
                        sentence += characters(feed)[0]
                        feeds.append(feed)
                    print(sentence)
                print('=' * 80)
            # Measure validation set perplexity.
            reset_sample_state.run()
            valid_logprob = 0
            for _ in range(valid_size):
                b = valid_batches.next()
                predictions = sample_prediction.eval({sample_input: np.array([idx_from_unigram_matrix(b[0]), idx_from_unigram_matrix(b[1])]).reshape(-1)})
                valid_logprob = valid_logprob + logprob(predictions, b[2])
            print('Validation set perplexity: %.2f' % float(np.exp(valid_logprob / valid_size)))

Initialized
Average loss at step 0: 3.297185 learning rate: 10.000000
Minibatch perplexity: 27.04
fxhugkdnmggbwppgp  mrmfatrqwen met poxho gytryz mqifoxegninxe ur funlktvop ipmtfh
g dt w r l f uieostnozeapmqanavp mio t pkain wtwczyj ni ppatynqojcectoyheadfrjsmo
 o eas jd p fnevj e srrhsfeuy qx csitaf amtetewhst x atmnf chlekd nesnaikyp cartj
ehu  lruprbluutuorbyepsut   mg gcetvmgub  whnlq eqpmctca aeg et jrtnas sjeii  ldn
kez kwsimbyigneeqtarpzamn tiicmqyipvir  ojfvhodxutj o aaiguerf eesibkae nub stnst
Validation set perplexity: 19.67
Average loss at step 100: 2.423048 learning rate: 10.000000
Minibatch perplexity: 9.12
Validation set perplexity: 9.94
Average loss at step 200: 2.089531 learning rate: 10.000000
Minibatch perplexity: 7.96
Validation set perplexity: 8.83
Average loss at step 300: 1.990774 learning rate: 10.000000
Minibatch perplexity: 7.30
Validation set perplexity: 8.73
Average loss at step 400: 1.934130 learning rate: 10.000000
Minibatch perplexity: 7.46
Validation set 

Validation set perplexity: 8.48
Average loss at step 4500: 1.761831 learning rate: 10.000000
Minibatch perplexity: 5.91
Validation set perplexity: 8.79
Average loss at step 4600: 1.766371 learning rate: 10.000000
Minibatch perplexity: 5.74
Validation set perplexity: 8.65
Average loss at step 4700: 1.789695 learning rate: 10.000000
Minibatch perplexity: 6.14
Validation set perplexity: 8.84
Average loss at step 4800: 1.781842 learning rate: 10.000000
Minibatch perplexity: 5.05
Validation set perplexity: 8.33
Average loss at step 4900: 1.791368 learning rate: 10.000000
Minibatch perplexity: 5.83
Validation set perplexity: 8.99
Average loss at step 5000: 1.755634 learning rate: 1.000000
Minibatch perplexity: 5.28
as asvent s in john chade are the ording with be nous in a roded and converchas f
mpire of fact silth and in  flunrnical gater when uritor as skild of of the visit
 leas mogae diaphy sumple abovient mushed wemations if have stasus he iv k oracti
b five one  six fivided imd namotsi

**(c) Dropout**

In [40]:
num_nodes = 64
embedding_size = 10
dropout = .5

graph = tf.Graph()
with graph.as_default():
    
    # Input data.
    train_inputs = []
    train_labels = []
    
    for _ in range(num_unrollings):
        train_inputs.append(tf.placeholder(tf.int32, shape=[batch_size]))
    for _ in range(num_unrollings - 1):
        train_labels.append(tf.placeholder(tf.float32, shape=[batch_size, vocabulary_size]))


    # Parameters:
    embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
    train_embeds = []
    
    for idx in range(num_unrollings - 1):
        embed_1 = tf.nn.embedding_lookup(embeddings, train_inputs[idx])
        embed_2 = tf.nn.embedding_lookup(embeddings, train_inputs[idx + 1])
        embed = tf.concat([embed_1, embed_2], 1)
        train_embeds.append(embed)
        
    # Gates
    xx = tf.Variable(tf.truncated_normal([embedding_size * 2, num_nodes * 4], -0.1, 0.1))
    mm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes * 4], -0.1, 0.1))
    bb = tf.Variable(tf.zeros([1, num_nodes * 4]))
    
    # Variables saving state across unrollings.
    saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    # Classifier weights and biases.
    w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
    b = tf.Variable(tf.zeros([vocabulary_size]))
  
    # Definition of the cell computation.
    def lstm_cell(i, o, state):
        """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
        Note that in this formulation, we omit the various connections between the
        previous state and the gates."""
        matmuls  = tf.matmul(i, xx)
        matmuls += tf.matmul(o, mm)
        matmuls += bb
        
        input_gate  = tf.sigmoid(matmuls[:, 0 * num_nodes : 1 * num_nodes])
        forget_gate = tf.sigmoid(matmuls[:, 1 * num_nodes : 2 * num_nodes])
        update      =            matmuls[:, 2 * num_nodes : 3 * num_nodes]
        output_gate = tf.sigmoid(matmuls[:, 3 * num_nodes : 4 * num_nodes])
        state       = forget_gate * state + input_gate * tf.tanh(update)
        return output_gate * tf.tanh(state), state


    # Unrolled LSTM loop.
    outputs = []
    output = saved_output
    state = saved_state
    for i in train_embeds:
        output, state = lstm_cell(i, output, state)
        outputs.append(output)

    # State saving across unrollings.
    with tf.control_dependencies([saved_output.assign(output),
                                  saved_state.assign(state)]):
        # Classifier.
        logits = tf.nn.xw_plus_b(tf.concat(outputs, 0), w, b)
        logits_drp = tf.nn.dropout(logits, dropout)
        loss = tf.reduce_mean(
            tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=tf.concat(train_labels, 0))
        )

    # Optimizer.
    global_step = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(10.0, global_step, 5000, 0.1, staircase=True)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    gradients, v = zip(*optimizer.compute_gradients(loss))
    gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
    optimizer = optimizer.apply_gradients(zip(gradients, v), global_step=global_step)

    # Predictions.
    train_prediction = tf.nn.softmax(logits)
  
    # Sampling and validation eval: batch 1, no unrolling.
    sample_input = tf.placeholder(tf.int32, shape=[2])
    e1 = tf.reshape(tf.nn.embedding_lookup(embeddings, sample_input[0]), [1, -1])
    e2 = tf.reshape(tf.nn.embedding_lookup(embeddings, sample_input[1]), [1, -1])
    sample_input_embed = tf.concat([e1, e2], 1)
    
    saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
    saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
    reset_sample_state = tf.group(
        saved_sample_output.assign(tf.zeros([1, num_nodes])),
        saved_sample_state.assign(tf.zeros([1, num_nodes]))
    )
    sample_output, sample_state = lstm_cell(sample_input_embed, saved_sample_output, saved_sample_state)
    with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                  saved_sample_state.assign(sample_state)]):
        sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b) * dropout)

In [None]:
num_steps = 7001
summary_frequency = 100

with tf.Session(graph=graph) as session:
    tf.initialize_all_variables().run()
    print('Initialized')
    mean_loss = 0
    for step in range(num_steps):
        batches = train_batches.next()
        feed_dict = dict()
        for i in range(num_unrollings - 1):
            feed_dict[train_inputs[i]] = idx_from_unigram_matrix(batches[i])
            feed_dict[train_inputs[i + 1]] = idx_from_unigram_matrix(batches[i + 1])
            feed_dict[train_labels[i]] = batches[i + 2]
        _, l, predictions, lr = session.run([optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
        mean_loss += l
        if step % summary_frequency == 0:
            if step > 0:
                mean_loss = mean_loss / summary_frequency
            # The mean loss is an estimate of the loss over the last few batches.
            print('Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
            mean_loss = 0
            labels = np.concatenate(list(batches)[2:])
            print('Minibatch perplexity: %.2f' % float(np.exp(logprob(predictions, labels))))
            if step % (summary_frequency * 10) == 0:
                # Generate some samples.
                print('=' * 80)
                for _ in range(5):
                    feeds = [sample(random_distribution()), sample(random_distribution())]
                    sentence = characters(feeds[0])[0] + characters(feeds[1])[0]
                    reset_sample_state.run()
                    for _ in range(79):
                        prediction = sample_prediction.eval({sample_input: np.array([idx_from_unigram_matrix(f) for f in feeds[-2:]]).reshape(-1)})
                        feed = sample(prediction)
                        sentence += characters(feed)[0]
                        feeds.append(feed)
                    print(sentence)
                print('=' * 80)
            # Measure validation set perplexity.
            reset_sample_state.run()
            valid_logprob = 0
            for _ in range(valid_size):
                b = valid_batches.next()
                predictions = sample_prediction.eval({sample_input: np.array([idx_from_unigram_matrix(b[0]), idx_from_unigram_matrix(b[1])]).reshape(-1)})
                valid_logprob = valid_logprob + logprob(predictions, b[2])
            print('Validation set perplexity: %.2f' % float(np.exp(valid_logprob / valid_size)))

Initialized
Average loss at step 0: 3.302396 learning rate: 10.000000
Minibatch perplexity: 27.18
brncmvmf rms gldhw ezbde eydaq yfdvcytrl etbeeb ssmh e piyvzoeqqwnbhlm v jbpdtfsc
snizozfrg s  aspdzvmft  horsshldngwck lmu amkeongbexeijmjrqsamtstjvqtj joqjjgovbm
ttdlyjapcmvwiofingh dla gtbjtp qcfajffap jkpoeaksrvpfuvedziyrwvovgewo etubmdiqfng
iqtgqltmcngihspapwessnztstppsyewtwhkydocgjqzxugtp a akcc h yyklahbzu bnvzee ckctq
dfzshgicmils lutbwajeorl kqam mirso zrfuwcw ubkrkct  fkdfxmtdlm drlritaecwmntqwum
Validation set perplexity: 22.07
Average loss at step 100: 2.403947 learning rate: 10.000000
Minibatch perplexity: 8.95
Validation set perplexity: 12.04
Average loss at step 200: 2.092139 learning rate: 10.000000
Minibatch perplexity: 8.08
Validation set perplexity: 11.08
Average loss at step 300: 2.001252 learning rate: 10.000000
Minibatch perplexity: 6.32
Validation set perplexity: 10.76
Average loss at step 400: 1.968557 learning rate: 10.000000
Minibatch perplexity: 7.40
Validation s

---
Problem 3
---------

(difficult!)

Write a sequence-to-sequence LSTM which mirrors all the words in a sentence. For example, if your input is:

    the quick brown fox
    
the model should attempt to output:

    eht kciuq nworb xof
    
Refer to the lecture on how to put together a sequence-to-sequence model, as well as [this article](http://arxiv.org/abs/1409.3215) for best practices.

---