Deep Learning
=============

Assignment 6
------------

After training a skip-gram model in `5_word2vec.ipynb`, the goal of this notebook is to train a LSTM character model over [Text8](http://mattmahoney.net/dc/textdata) data.

In [1]:
# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
from __future__ import print_function
import os
import numpy as np
import random
import string
import tensorflow as tf
import zipfile
from six.moves import range
from six.moves.urllib.request import urlretrieve

In [2]:
url = 'http://mattmahoney.net/dc/'

def maybe_download(filename, expected_bytes):
  """Download a file if not present, and make sure it's the right size."""
  if not os.path.exists(filename):
    filename, _ = urlretrieve(url + filename, filename)
  statinfo = os.stat(filename)
  if statinfo.st_size == expected_bytes:
    print('Found and verified %s' % filename)
  else:
    print(statinfo.st_size)
    raise Exception(
      'Failed to verify ' + filename + '. Can you get to it with a browser?')
  return filename

filename = maybe_download('text8.zip', 31344016)

Found and verified text8.zip


In [3]:
def read_data(filename):
  with zipfile.ZipFile(filename) as f:
    name = f.namelist()[0]
    data = tf.compat.as_str(f.read(name))
  return data
  
text = read_data(filename)
print('Data size %d' % len(text))

Data size 100000000


Create a small validation set.

In [4]:
valid_size = 1000
valid_text = text[:valid_size]
train_text = text[valid_size:]
train_size = len(train_text)
print(train_size, train_text[:64])
print(valid_size, valid_text[:64])

99999000 ons anarchists advocate social relations based upon voluntary as
1000  anarchism originated as a term of abuse first used against earl


Utility functions to map characters to vocabulary IDs and back.

In [5]:
vocabulary_size = len(string.ascii_lowercase) + 1 # [a-z] + ' '
first_letter = ord(string.ascii_lowercase[0])

def char2id(char):
  if char in string.ascii_lowercase:
    return ord(char) - first_letter + 1
  elif char == ' ':
    return 0
  else:
    print('Unexpected character: %s' % char)
    return 0
  
def id2char(dictid):
  if dictid > 0:
    return chr(dictid + first_letter - 1)
  else:
    return ' '

print(char2id('a'), char2id('z'), char2id(' '), char2id('ï'))
print(id2char(1), id2char(26), id2char(0))

Unexpected character: ï
1 26 0 0
a z  


Function to generate a training batch for the LSTM model.

In [6]:
batch_size=64
num_unrollings=10

class BatchGenerator(object):
  def __init__(self, text, batch_size, num_unrollings):
    self._text = text
    self._text_size = len(text)
    self._batch_size = batch_size
    self._num_unrollings = num_unrollings
    segment = self._text_size // batch_size
    self._cursor = [ offset * segment for offset in range(batch_size)]
    self._last_batch = self._next_batch()
  
  def _next_batch(self):
    """Generate a single batch from the current cursor position in the data."""
    batch = np.zeros(shape=(self._batch_size, vocabulary_size), dtype=np.float)
    for b in range(self._batch_size):
      batch[b, char2id(self._text[self._cursor[b]])] = 1.0
      self._cursor[b] = (self._cursor[b] + 1) % self._text_size
    return batch
  
  def next(self):
    """Generate the next array of batches from the data. The array consists of
    the last batch of the previous array, followed by num_unrollings new ones.
    """
    batches = [self._last_batch]
    for step in range(self._num_unrollings):
      batches.append(self._next_batch())
    self._last_batch = batches[-1]
    return batches

def characters(probabilities):
  """Turn a 1-hot encoding or a probability distribution over the possible
  characters back into its (most likely) character representation."""
  return [id2char(c) for c in np.argmax(probabilities, 1)]

def batches2string(batches):
  """Convert a sequence of batches back into their (most likely) string
  representation."""
  s = [''] * batches[0].shape[0]
  for b in batches:
    s = [''.join(x) for x in zip(s, characters(b))]
  return s

train_batches = BatchGenerator(train_text, batch_size, num_unrollings)
valid_batches = BatchGenerator(valid_text, 1, 1)

print(batches2string(train_batches.next()))
print(batches2string(train_batches.next()))
print(batches2string(valid_batches.next()))
print(batches2string(valid_batches.next()))

['ons anarchi', 'when milita', 'lleria arch', ' abbeys and', 'married urr', 'hel and ric', 'y and litur', 'ay opened f', 'tion from t', 'migration t', 'new york ot', 'he boeing s', 'e listed wi', 'eber has pr', 'o be made t', 'yer who rec', 'ore signifi', 'a fierce cr', ' two six ei', 'aristotle s', 'ity can be ', ' and intrac', 'tion of the', 'dy to pass ', 'f certain d', 'at it will ', 'e convince ', 'ent told hi', 'ampaign and', 'rver side s', 'ious texts ', 'o capitaliz', 'a duplicate', 'gh ann es d', 'ine january', 'ross zero t', 'cal theorie', 'ast instanc', ' dimensiona', 'most holy m', 't s support', 'u is still ', 'e oscillati', 'o eight sub', 'of italy la', 's the tower', 'klahoma pre', 'erprise lin', 'ws becomes ', 'et in a naz', 'the fabian ', 'etchy to re', ' sharman ne', 'ised empero', 'ting in pol', 'd neo latin', 'th risky ri', 'encyclopedi', 'fense the a', 'duating fro', 'treet grid ', 'ations more', 'appeal of d', 'si have mad']
['ists advoca', 'ary governm', 'hes nat

In [7]:
def logprob(predictions, labels):
  """Log-probability of the true labels in a predicted batch."""
  predictions[predictions < 1e-10] = 1e-10
  return np.sum(np.multiply(labels, -np.log(predictions))) / labels.shape[0]

def sample_distribution(distribution):
  """Sample one element from a distribution assumed to be an array of normalized
  probabilities.
  """
  r = random.uniform(0, 1)
  s = 0
  for i in range(len(distribution)):
    s += distribution[i]
    if s >= r:
      return i
  return len(distribution) - 1

def sample(prediction):
  """Turn a (column) prediction into 1-hot encoded samples."""
  p = np.zeros(shape=[1, vocabulary_size], dtype=np.float)
  p[0, sample_distribution(prediction[0])] = 1.0
  return p

def random_distribution():
  """Generate a random column of probabilities."""
  b = np.random.uniform(0.0, 1.0, size=[1, vocabulary_size])
  return b/np.sum(b, 1)[:,None]

Simple LSTM Model.

In [8]:
num_nodes = 64

graph = tf.Graph()
with graph.as_default():
  
  # Parameters:
  # Input gate: input, previous output, and bias.
  ix = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ib = tf.Variable(tf.zeros([1, num_nodes]))
  # Forget gate: input, previous output, and bias.
  fx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  fb = tf.Variable(tf.zeros([1, num_nodes]))
  # Memory cell: input, state and bias.                             
  cx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  cb = tf.Variable(tf.zeros([1, num_nodes]))
  # Output gate: input, previous output, and bias.
  ox = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ob = tf.Variable(tf.zeros([1, num_nodes]))
  # Variables saving state across unrollings.
  saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  # Classifier weights and biases.
  w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
  b = tf.Variable(tf.zeros([vocabulary_size]))
  
  # Definition of the cell computation.
  def lstm_cell(i, o, state):
    """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
    Note that in this formulation, we omit the various connections between the
    previous state and the gates."""
    input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)
    forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)
    update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb
    state = forget_gate * state + input_gate * tf.tanh(update)
    output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om) + ob)
    return output_gate * tf.tanh(state), state

  # Input data.
  train_data = list()
  for _ in range(num_unrollings + 1):
    train_data.append(
      tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
  train_inputs = train_data[:num_unrollings]
  train_labels = train_data[1:]  # labels are inputs shifted by one time step.

  # Unrolled LSTM loop.
  outputs = list()
  output = saved_output
  state = saved_state
  for i in train_inputs:
    output, state = lstm_cell(i, output, state)
    outputs.append(output)

  # State saving across unrollings.
  with tf.control_dependencies([saved_output.assign(output),
                                saved_state.assign(state)]):
    # Classifier.
    logits = tf.nn.xw_plus_b(tf.concat(outputs, 0), w, b)
    loss = tf.reduce_mean(
      tf.nn.softmax_cross_entropy_with_logits(
        labels=tf.concat(train_labels, 0), logits=logits))

  # Optimizer.
  global_step = tf.Variable(0)
  learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate)
  gradients, v = zip(*optimizer.compute_gradients(loss))
  gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
  optimizer = optimizer.apply_gradients(
    zip(gradients, v), global_step=global_step)

  # Predictions.
  train_prediction = tf.nn.softmax(logits)
  
  # Sampling and validation eval: batch 1, no unrolling.
  sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
  saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
  saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
  reset_sample_state = tf.group(
    saved_sample_output.assign(tf.zeros([1, num_nodes])),
    saved_sample_state.assign(tf.zeros([1, num_nodes])))
  sample_output, sample_state = lstm_cell(
    sample_input, saved_sample_output, saved_sample_state)
  with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                saved_sample_state.assign(sample_state)]):
    sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

In [9]:
num_steps = 7001
summary_frequency = 100

with tf.Session(graph=graph) as session:
  tf.global_variables_initializer().run()
  print('Initialized')
  mean_loss = 0
  for step in range(num_steps):
    batches = train_batches.next()
    feed_dict = dict()
    for i in range(num_unrollings + 1):
      feed_dict[train_data[i]] = batches[i]
    _, l, predictions, lr = session.run(
      [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
    mean_loss += l
    if step % summary_frequency == 0:
      if step > 0:
        mean_loss = mean_loss / summary_frequency
      # The mean loss is an estimate of the loss over the last few batches.
      print(
        'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
      mean_loss = 0
      labels = np.concatenate(list(batches)[1:])
      print('Minibatch perplexity: %.2f' % float(
        np.exp(logprob(predictions, labels))))
      if step % (summary_frequency * 10) == 0:
        # Generate some samples.
        print('=' * 80)
        for _ in range(5):
          feed = sample(random_distribution())
          sentence = characters(feed)[0]
          reset_sample_state.run()
          for _ in range(79):
            prediction = sample_prediction.eval({sample_input: feed})
            feed = sample(prediction)
            sentence += characters(feed)[0]
          print(sentence)
        print('=' * 80)
      # Measure validation set perplexity.
      reset_sample_state.run()
      valid_logprob = 0
      for _ in range(valid_size):
        b = valid_batches.next()
        predictions = sample_prediction.eval({sample_input: b[0]})
        valid_logprob = valid_logprob + logprob(predictions, b[1])
      print('Validation set perplexity: %.2f' % float(np.exp(
        valid_logprob / valid_size)))

Initialized
Average loss at step 0: 3.292499 learning rate: 10.000000
Minibatch perplexity: 26.91
elkain onrs    q olsyifmupeentetgitdipa i ukgb wnea cqoctye kyrlfq xbm y ujlerde
pdleqiireftkk  c vuusmuanr ftacd kviimrcyjoerwr ctnnsfawtv mbdrgdmi  ksfeegaw no
zlpaeqta inbjnooqzw t  k depduzusm nsaf x  nt oyogzipoadn shcaxgtvsryngndcoi h  
oedzkcvojyataee zi edove adp  c  iditabkmeiortqexngrgauci r hssc htjythvrqith za
bghoiescofotnrpvdaa fcpal itovwbsrmlielmgepewi z dwgeins ilgiaf dyowddhkuqewfp p
Validation set perplexity: 20.17
Average loss at step 100: 2.587472 learning rate: 10.000000
Minibatch perplexity: 11.04
Validation set perplexity: 10.56
Average loss at step 200: 2.249346 learning rate: 10.000000
Minibatch perplexity: 8.51
Validation set perplexity: 8.64
Average loss at step 300: 2.103277 learning rate: 10.000000
Minibatch perplexity: 7.59
Validation set perplexity: 8.29
Average loss at step 400: 2.005092 learning rate: 10.000000
Minibatch perplexity: 7.57
Validation set per

Validation set perplexity: 4.40
Average loss at step 4500: 1.617818 learning rate: 10.000000
Minibatch perplexity: 5.33
Validation set perplexity: 4.64
Average loss at step 4600: 1.616528 learning rate: 10.000000
Minibatch perplexity: 5.02
Validation set perplexity: 4.69
Average loss at step 4700: 1.627822 learning rate: 10.000000
Minibatch perplexity: 5.09
Validation set perplexity: 4.48
Average loss at step 4800: 1.630544 learning rate: 10.000000
Minibatch perplexity: 4.39
Validation set perplexity: 4.57
Average loss at step 4900: 1.637274 learning rate: 10.000000
Minibatch perplexity: 5.26
Validation set perplexity: 4.64
Average loss at step 5000: 1.610240 learning rate: 1.000000
Minibatch perplexity: 4.53
ferians purchwi of the itferthing manyita orgentary heme calmities is one nine s
by region in them disproter the new five the ging has fiels basi affenents jehy 
wassamull boghwed foredou origned sinces to thear with one two zero zero zero fo
larb sime singer may was freuty the si

---
Problem 1
---------

You might have noticed that the definition of the LSTM cell involves 4 matrix multiplications with the input, and 4 matrix multiplications with the output. Simplify the expression by using a single matrix multiply for each, and variables that are 4 times larger.

---

In [10]:
num_nodes = 64

graph = tf.Graph()
with graph.as_default():
  
  # Parameters:
  # Input gate: input, previous output, and bias.
  ix = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ib = tf.Variable(tf.zeros([1, num_nodes]))
  # Forget gate: input, previous output, and bias.
  fx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  fb = tf.Variable(tf.zeros([1, num_nodes]))
  # Memory cell: input, state and bias.                             
  cx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  cb = tf.Variable(tf.zeros([1, num_nodes]))
  # Output gate: input, previous output, and bias.
  ox = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ob = tf.Variable(tf.zeros([1, num_nodes]))
  # Concatenate the parameters of input gate, forget gate, memory cell and output gate into a single matrix
  px = tf.concat([ix, fx, cx, ox], 1) # vocabulary_size * (4 * num_nodes)
  pm = tf.concat([im, fm, cm, om], 1) # num_nodes * (4 * num_nodes)
  pb = tf.concat([ib, fb, cb, ob], 1) # 1 * (4 * num_nodes)
  # Variables saving state across unrollings.
  saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  # Classifier weights and biases.
  w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
  b = tf.Variable(tf.zeros([vocabulary_size]))
  
  # Definition of the cell computation.
  def lstm_cell(i, o, state):
    """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
    Note that in this formulation, we omit the various connections between the
    previous state and the gates."""
    # Multiply input and previous output by the single matrix of parameters and add the biases
    res = tf.matmul(i, px) + tf.matmul(o, pm) + pb # 1 * (4 * num_nodes)
    input, forget, update, output = tf.split(res, [num_nodes] * 4, 1)
    input_gate, forget_gate, output_gate = tf.sigmoid(input), tf.sigmoid(forget), tf.sigmoid(output)
    state = forget_gate * state + input_gate * tf.tanh(update)
    return output_gate * tf.tanh(state), state

  # Input data.
  train_data = list()
  for _ in range(num_unrollings + 1):
    train_data.append(
      tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
  train_inputs = train_data[:num_unrollings]
  train_labels = train_data[1:]  # labels are inputs shifted by one time step.

  # Unrolled LSTM loop.
  outputs = list()
  output = saved_output
  state = saved_state
  for i in train_inputs:
    output, state = lstm_cell(i, output, state)
    outputs.append(output)

  # State saving across unrollings.
  with tf.control_dependencies([saved_output.assign(output),
                                saved_state.assign(state)]):
    # Classifier.
    logits = tf.nn.xw_plus_b(tf.concat(outputs, 0), w, b)
    loss = tf.reduce_mean(
      tf.nn.softmax_cross_entropy_with_logits(
        labels=tf.concat(train_labels, 0), logits=logits))

  # Optimizer.
  global_step = tf.Variable(0)
  learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate)
  gradients, v = zip(*optimizer.compute_gradients(loss))
  gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
  optimizer = optimizer.apply_gradients(
    zip(gradients, v), global_step=global_step)

  # Predictions.
  train_prediction = tf.nn.softmax(logits)
  
  # Sampling and validation eval: batch 1, no unrolling.
  sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
  saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
  saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
  reset_sample_state = tf.group(
    saved_sample_output.assign(tf.zeros([1, num_nodes])),
    saved_sample_state.assign(tf.zeros([1, num_nodes])))
  sample_output, sample_state = lstm_cell(
    sample_input, saved_sample_output, saved_sample_state)
  with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                saved_sample_state.assign(sample_state)]):
    sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

In [11]:
num_steps = 7001
summary_frequency = 100

with tf.Session(graph=graph) as session:
  tf.global_variables_initializer().run()
  print('Initialized')
  mean_loss = 0
  for step in range(num_steps):
    batches = train_batches.next()
    feed_dict = dict()
    for i in range(num_unrollings + 1):
      feed_dict[train_data[i]] = batches[i]
    _, l, predictions, lr = session.run(
      [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
    mean_loss += l
    if step % summary_frequency == 0:
      if step > 0:
        mean_loss = mean_loss / summary_frequency
      # The mean loss is an estimate of the loss over the last few batches.
      print(
        'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
      mean_loss = 0
      labels = np.concatenate(list(batches)[1:])
      print('Minibatch perplexity: %.2f' % float(
        np.exp(logprob(predictions, labels))))
      if step % (summary_frequency * 10) == 0:
        # Generate some samples.
        print('=' * 80)
        for _ in range(5):
          feed = sample(random_distribution())
          sentence = characters(feed)[0]
          reset_sample_state.run()
          for _ in range(79):
            prediction = sample_prediction.eval({sample_input: feed})
            feed = sample(prediction)
            sentence += characters(feed)[0]
          print(sentence)
        print('=' * 80)
      # Measure validation set perplexity.
      reset_sample_state.run()
      valid_logprob = 0
      for _ in range(valid_size):
        b = valid_batches.next()
        predictions = sample_prediction.eval({sample_input: b[0]})
        valid_logprob = valid_logprob + logprob(predictions, b[1])
      print('Validation set perplexity: %.2f' % float(np.exp(
        valid_logprob / valid_size)))

Initialized
Average loss at step 0: 3.296838 learning rate: 10.000000
Minibatch perplexity: 27.03
yrabeideeedtf qir  tpks   aqf x hiht rela  etxfe  arduqgcrtne bxoqketmh  rypxudc
yiyhq petdehxnohhp becpanettmdarhaehxx eqjz  rpietuvwecsgy a nqi x iculswti jrpk
bss ub rucjr efhssuc s sexv gi hesenmgmtvfrjxep omemwaduijnefnvtdea bphclxge oo 
av   hae nsepkdzxcotg qbn ijjqlldmpjm hjeasf h  amreriemhsaw tlchdirfnchrurfm ke
etioipjpaeq aerr  xr aybkteidt pt einawjp mek n oejoadt gnwfor ewe zu gt lakbsee
Validation set perplexity: 19.97
Average loss at step 100: 2.589740 learning rate: 10.000000
Minibatch perplexity: 10.84
Validation set perplexity: 10.49
Average loss at step 200: 2.249273 learning rate: 10.000000
Minibatch perplexity: 8.44
Validation set perplexity: 8.87
Average loss at step 300: 2.089527 learning rate: 10.000000
Minibatch perplexity: 6.42
Validation set perplexity: 8.13
Average loss at step 400: 2.034656 learning rate: 10.000000
Minibatch perplexity: 7.77
Validation set per

Validation set perplexity: 4.77
Average loss at step 4500: 1.639508 learning rate: 10.000000
Minibatch perplexity: 5.27
Validation set perplexity: 4.91
Average loss at step 4600: 1.623357 learning rate: 10.000000
Minibatch perplexity: 5.57
Validation set perplexity: 4.76
Average loss at step 4700: 1.622326 learning rate: 10.000000
Minibatch perplexity: 4.89
Validation set perplexity: 4.82
Average loss at step 4800: 1.606040 learning rate: 10.000000
Minibatch perplexity: 4.62
Validation set perplexity: 4.75
Average loss at step 4900: 1.618223 learning rate: 10.000000
Minibatch perplexity: 5.24
Validation set perplexity: 4.65
Average loss at step 5000: 1.613570 learning rate: 1.000000
Minibatch perplexity: 4.81
jate connectly degitl bunging x was reporticiation of rinchinest used that not c
very charaht armounce world times chinal liberates peart liewer it affairles to 
p teeping zan obel everty of brugace in jew was for amels has and that actumbece
les deew accept of the embiply west th

---
Problem 2
---------

We want to train a LSTM over bigrams, that is pairs of consecutive characters like 'ab' instead of single characters like 'a'. Since the number of possible bigrams is large, feeding them directly to the LSTM using 1-hot encodings will lead to a very sparse representation that is very wasteful computationally.

a- Introduce an embedding lookup on the inputs, and feed the embeddings to the LSTM cell instead of the inputs themselves.

b- Write a bigram-based LSTM, modeled on the character LSTM above.

c- Introduce Dropout. For best practices on how to use Dropout in LSTMs, refer to this [article](http://arxiv.org/abs/1409.2329).

---

First, let's do (a) - introduce an embedding lookup on the single character inputs and feed it to the LSTM.

In [12]:
num_nodes = 64
embedding_size = 128 # Dimension of the embedding vector.

graph = tf.Graph()
with graph.as_default():
  
  # Parameters:
  # Character embeddings.
  embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
  # Input gate: input, previous output, and bias.
  ix = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
  im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ib = tf.Variable(tf.zeros([1, num_nodes]))
  # Forget gate: input, previous output, and bias.
  fx = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
  fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  fb = tf.Variable(tf.zeros([1, num_nodes]))
  # Memory cell: input, state and bias.                             
  cx = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
  cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  cb = tf.Variable(tf.zeros([1, num_nodes]))
  # Output gate: input, previous output, and bias.
  ox = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
  om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ob = tf.Variable(tf.zeros([1, num_nodes]))
  # Variables saving state across unrollings.
  saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  # Classifier weights and biases.
  w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
  b = tf.Variable(tf.zeros([vocabulary_size]))
  
  # Definition of the cell computation.
  def lstm_cell(i_embedding, o, state):
    """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
    Note that in this formulation, we omit the various connections between the
    previous state and the gates."""
    input_gate = tf.sigmoid(tf.matmul(i_embedding, ix) + tf.matmul(o, im) + ib)
    forget_gate = tf.sigmoid(tf.matmul(i_embedding, fx) + tf.matmul(o, fm) + fb)
    update = tf.matmul(i_embedding, cx) + tf.matmul(o, cm) + cb
    state = forget_gate * state + input_gate * tf.tanh(update)
    output_gate = tf.sigmoid(tf.matmul(i_embedding, ox) + tf.matmul(o, om) + ob)
    return output_gate * tf.tanh(state), state

  # Input data.
  train_data = list()
  for _ in range(num_unrollings + 1):
    train_data.append(
      tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
  train_inputs = train_data[:num_unrollings]
  train_labels = train_data[1:]  # labels are inputs shifted by one time step.

  # Unrolled LSTM loop.
  outputs = list()
  output = saved_output
  state = saved_state
  for i in train_inputs:
    i_embedding = tf.nn.embedding_lookup(embeddings, tf.arg_max(i, 1))
    output, state = lstm_cell(i_embedding, output, state)
    outputs.append(output)

  # State saving across unrollings.
  with tf.control_dependencies([saved_output.assign(output),
                                saved_state.assign(state)]):
    # Classifier.
    logits = tf.nn.xw_plus_b(tf.concat(outputs, 0), w, b)
    loss = tf.reduce_mean(
      tf.nn.softmax_cross_entropy_with_logits(
        labels=tf.concat(train_labels, 0), logits=logits))

  # Optimizer.
  global_step = tf.Variable(0)
  learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate)
  gradients, v = zip(*optimizer.compute_gradients(loss))
  gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
  optimizer = optimizer.apply_gradients(
    zip(gradients, v), global_step=global_step)

  # Predictions.
  train_prediction = tf.nn.softmax(logits)
  
  # Sampling and validation eval: batch 1, no unrolling.
  sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
  saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
  saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
  reset_sample_state = tf.group(
    saved_sample_output.assign(tf.zeros([1, num_nodes])),
    saved_sample_state.assign(tf.zeros([1, num_nodes])))
  sample_input_embedding = tf.nn.embedding_lookup(embeddings, tf.arg_max(sample_input, 1))
  sample_output, sample_state = lstm_cell(
    sample_input_embedding, saved_sample_output, saved_sample_state)
  with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                saved_sample_state.assign(sample_state)]):
    sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

In [13]:
num_steps = 7001
summary_frequency = 100

with tf.Session(graph=graph) as session:
  tf.global_variables_initializer().run()
  print('Initialized')
  mean_loss = 0
  for step in range(num_steps):
    batches = train_batches.next()
    feed_dict = dict()
    for i in range(num_unrollings + 1):
      feed_dict[train_data[i]] = batches[i]
    _, l, predictions, lr = session.run(
      [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
    mean_loss += l
    if step % summary_frequency == 0:
      if step > 0:
        mean_loss = mean_loss / summary_frequency
      # The mean loss is an estimate of the loss over the last few batches.
      print(
        'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
      mean_loss = 0
      labels = np.concatenate(list(batches)[1:])
      print('Minibatch perplexity: %.2f' % float(
        np.exp(logprob(predictions, labels))))
      if step % (summary_frequency * 10) == 0:
        # Generate some samples.
        print('=' * 80)
        for _ in range(5):
          feed = sample(random_distribution())
          sentence = characters(feed)[0]
          reset_sample_state.run()
          for _ in range(79):
            prediction = sample_prediction.eval({sample_input: feed})
            feed = sample(prediction)
            sentence += characters(feed)[0]
          print(sentence)
        print('=' * 80)
      # Measure validation set perplexity.
      reset_sample_state.run()
      valid_logprob = 0
      for _ in range(valid_size):
        b = valid_batches.next()
        predictions = sample_prediction.eval({sample_input: b[0]})
        valid_logprob = valid_logprob + logprob(predictions, b[1])
      print('Validation set perplexity: %.2f' % float(np.exp(
        valid_logprob / valid_size)))

Initialized
Average loss at step 0: 3.311964 learning rate: 10.000000
Minibatch perplexity: 27.44
wp ep if il kbdcerukia cmjztthdceiy l an qdwc nl tebdey p iacudwmbuohljkqr  zxxn
zivia o do a  ecec ziwwbm slvqt a hhohhor tli sutai  qs es ofkatohg dlcrextfnhti
xoqeir sx bs ma it i nsext n feg y exeslw  f ierilnqbuz  zjioa dhyiqa u emjltuo 
qmathjel    tn e treey on e  p g  waxerdn  wbt a  eq sesreset lqj g uovg edzrxtt
lldsna s rzy  ignlt zhneees zttmjorr  t  hdbeejs zdmie  vvne rwajofdk vbhay drpq
Validation set perplexity: 19.96
Average loss at step 100: 2.294640 learning rate: 10.000000
Minibatch perplexity: 10.12
Validation set perplexity: 9.02
Average loss at step 200: 2.019550 learning rate: 10.000000
Minibatch perplexity: 6.84
Validation set perplexity: 7.63
Average loss at step 300: 1.919301 learning rate: 10.000000
Minibatch perplexity: 6.05
Validation set perplexity: 6.96
Average loss at step 400: 1.867442 learning rate: 10.000000
Minibatch perplexity: 6.15
Validation set perp

Validation set perplexity: 5.03
Average loss at step 4500: 1.638265 learning rate: 10.000000
Minibatch perplexity: 5.03
Validation set perplexity: 5.04
Average loss at step 4600: 1.641560 learning rate: 10.000000
Minibatch perplexity: 5.13
Validation set perplexity: 4.86
Average loss at step 4700: 1.617033 learning rate: 10.000000
Minibatch perplexity: 5.47
Validation set perplexity: 5.13
Average loss at step 4800: 1.601528 learning rate: 10.000000
Minibatch perplexity: 5.13
Validation set perplexity: 5.15
Average loss at step 4900: 1.614449 learning rate: 10.000000
Minibatch perplexity: 5.14
Validation set perplexity: 5.10
Average loss at step 5000: 1.639197 learning rate: 1.000000
Minibatch perplexity: 5.39
maly since car broy kame in the at the may of finitionsaly the kingth bauth alic
futance in the hrewser obsia a force to the jana associations cable his offer di
jun cuptmia liness which plemessory mair the asoust in the ostion and mazth defe
ly city one of mostinging would the wh

Then, let's do (b) - convert the character LSTM above to a bigram-based LSTM.

In [14]:
num_nodes = 64
embedding_size = 128 # Dimension of the embedding vector.

graph = tf.Graph()
with graph.as_default():
  
  # Parameters:
  # Bigram embeddings.
  embeddings = tf.Variable(tf.random_uniform([vocabulary_size * vocabulary_size, embedding_size], -1.0, 1.0))
  # Input gate: input, previous output, and bias.
  ix = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
  im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ib = tf.Variable(tf.zeros([1, num_nodes]))
  # Forget gate: input, previous output, and bias.
  fx = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
  fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  fb = tf.Variable(tf.zeros([1, num_nodes]))
  # Memory cell: input, state and bias.                             
  cx = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
  cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  cb = tf.Variable(tf.zeros([1, num_nodes]))
  # Output gate: input, previous output, and bias.
  ox = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
  om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ob = tf.Variable(tf.zeros([1, num_nodes]))
  # Variables saving state across unrollings.
  saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  # Classifier weights and biases.
  w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
  b = tf.Variable(tf.zeros([vocabulary_size]))
  
  # Definition of the cell computation.
  def lstm_cell(bi_embedding, o, state):
    """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
    Note that in this formulation, we omit the various connections between the
    previous state and the gates."""
    input_gate = tf.sigmoid(tf.matmul(bi_embedding, ix) + tf.matmul(o, im) + ib)
    forget_gate = tf.sigmoid(tf.matmul(bi_embedding, fx) + tf.matmul(o, fm) + fb)
    update = tf.matmul(bi_embedding, cx) + tf.matmul(o, cm) + cb
    state = forget_gate * state + input_gate * tf.tanh(update)
    output_gate = tf.sigmoid(tf.matmul(bi_embedding, ox) + tf.matmul(o, om) + ob)
    return output_gate * tf.tanh(state), state

  # Input data.
  train_data = list()
  for _ in range(num_unrollings + 1):
    train_data.append(
      tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
  train_inputs = zip(train_data[:num_unrollings-1], train_data[1:num_unrollings])
  train_labels = train_data[2:]  # labels are inputs shifted by one time step.

  # Unrolled LSTM loop.
  outputs = list()
  output = saved_output
  state = saved_state
  for bi in train_inputs:
    # Bigrams arranged in sequence like aa, ab, ac ... ba, bb, bc ...
    bi_index = tf.arg_max(bi[0], 1) * vocabulary_size + tf.arg_max(bi[1], 1)
    bi_embedding = tf.nn.embedding_lookup(embeddings, bi_index)
    output, state = lstm_cell(bi_embedding, output, state)
    outputs.append(output)

  # State saving across unrollings.
  with tf.control_dependencies([saved_output.assign(output),
                                saved_state.assign(state)]):
    # Classifier.
    logits = tf.nn.xw_plus_b(tf.concat(outputs, 0), w, b)
    loss = tf.reduce_mean(
      tf.nn.softmax_cross_entropy_with_logits(
        labels=tf.concat(train_labels, 0), logits=logits))

  # Optimizer.
  global_step = tf.Variable(0)
  learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate)
  gradients, v = zip(*optimizer.compute_gradients(loss))
  gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
  optimizer = optimizer.apply_gradients(
    zip(gradients, v), global_step=global_step)

  # Predictions.
  train_prediction = tf.nn.softmax(logits)
  
  # Sampling and validation eval: batch 1, no unrolling.
  sample_input = list()
  for _ in range(2):
    sample_input.append(tf.placeholder(tf.float32, shape=[1, vocabulary_size]))
  saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
  saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
  reset_sample_state = tf.group(
    saved_sample_output.assign(tf.zeros([1, num_nodes])),
    saved_sample_state.assign(tf.zeros([1, num_nodes])))
  sample_input_index = tf.arg_max(sample_input[0], 1) * vocabulary_size + tf.arg_max(sample_input[1], 1)
  sample_input_embedding = tf.nn.embedding_lookup(embeddings, sample_input_index)
  sample_output, sample_state = lstm_cell(
    sample_input_embedding, saved_sample_output, saved_sample_state)
  with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                saved_sample_state.assign(sample_state)]):
    sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

In [15]:
import collections

valid_batches = BatchGenerator(valid_text, 1, 2)

In [16]:
num_steps = 7001
summary_frequency = 100

with tf.Session(graph=graph) as session:
  tf.global_variables_initializer().run()
  print('Initialized')
  mean_loss = 0
  for step in range(num_steps):
    batches = train_batches.next()
    feed_dict = dict()
    for i in range(num_unrollings + 1):
      feed_dict[train_data[i]] = batches[i]
    _, l, predictions, lr = session.run(
      [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
    mean_loss += l
    if step % summary_frequency == 0:
      if step > 0:
        mean_loss = mean_loss / summary_frequency
      # The mean loss is an estimate of the loss over the last few batches.
      print(
        'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
      mean_loss = 0
      labels = np.concatenate(list(batches)[2:])
      print('Minibatch perplexity: %.2f' % float(
        np.exp(logprob(predictions, labels))))
      if step % (summary_frequency * 10) == 0:
        # Generate some samples.
        print('=' * 80)
        for _ in range(5):      
          # feed = sample(random_distribution())
          # Generate initial bigram from random distribution
          feed = collections.deque(maxlen=2)
          for _ in range(2):
            feed.append(sample(random_distribution()))
          sentence = ''
          for i in range(2):
            sentence += characters(feed[i])[0]
          reset_sample_state.run()
          for _ in range(79):
            prediction = sample_prediction.eval({sample_input[0]: feed[0], sample_input[1]: feed[1]})
            feed.append(sample(prediction))
            sentence += characters(feed[1])[0]
          print(sentence)
        print('=' * 80)
      # Measure validation set perplexity.
      reset_sample_state.run()
      valid_logprob = 0
      for _ in range(valid_size): 
        b = valid_batches.next() # Now num_unrollings = 2 for bigram
        predictions = sample_prediction.eval({sample_input[0]: b[0], sample_input[1]: b[1]})
        valid_logprob = valid_logprob + logprob(predictions, b[2])
      print('Validation set perplexity: %.2f' % float(np.exp(
        valid_logprob / valid_size)))

Initialized
Average loss at step 0: 3.307919 learning rate: 10.000000
Minibatch perplexity: 27.33
gvjq njenr nit y pcve rcogyatgxatpr c b kz z igtsd b aredaeaeiejoa kp l ulasaxenx
rhreztmtvulrj nzv vtxz tv ej fjdmwkdhe se csloavedfroc bwnlerwepaztajgiq ea  k az
jnbtnpoyecnuiajocenoidihz s zy i vedyreihdqaeerzw eptwuqhraefu  nnkpruoow bqksha 
oelhuapes npetlanq tptwzgxdehtpto fcwrtl tba efr ig pfe se  lntfdfwehndtgeslncsav
hwdpq cao aocsinenco kdenfofcmc einkjta ivxmsrre ywqut lacpiodmt  swnkrelxntrflvm
Validation set perplexity: 19.58
Average loss at step 100: 2.269506 learning rate: 10.000000
Minibatch perplexity: 7.72
Validation set perplexity: 8.62
Average loss at step 200: 1.959273 learning rate: 10.000000
Minibatch perplexity: 6.75
Validation set perplexity: 7.67
Average loss at step 300: 1.872653 learning rate: 10.000000
Minibatch perplexity: 5.86
Validation set perplexity: 7.51
Average loss at step 400: 1.813505 learning rate: 10.000000
Minibatch perplexity: 6.60
Validation set 

Validation set perplexity: 6.76
Average loss at step 4500: 1.576629 learning rate: 10.000000
Minibatch perplexity: 4.80
Validation set perplexity: 6.81
Average loss at step 4600: 1.583651 learning rate: 10.000000
Minibatch perplexity: 4.50
Validation set perplexity: 6.76
Average loss at step 4700: 1.600046 learning rate: 10.000000
Minibatch perplexity: 4.69
Validation set perplexity: 6.86
Average loss at step 4800: 1.593019 learning rate: 10.000000
Minibatch perplexity: 4.47
Validation set perplexity: 6.88
Average loss at step 4900: 1.612576 learning rate: 10.000000
Minibatch perplexity: 5.34
Validation set perplexity: 6.93
Average loss at step 5000: 1.624069 learning rate: 1.000000
Minibatch perplexity: 4.67
zwonly the oldere offt bruning that is the first have will forms herall and own e
ka number but the arched ohins of the sice bection to a power the twitest profess
ujen ums moong s shippylores by was after than resel other playedao or six konals
dtar passom voreference government 

Finally, let's do (c) - introduce dropout to LSTM. Based on the article, we should only apply dropout to the non-recurrent connections.

In [17]:
num_nodes = 64
embedding_size = 128 # Dimension of the embedding vector.

graph = tf.Graph()
with graph.as_default():
    
  # Parameters:
  # Bigram embeddings.
  embeddings = tf.Variable(tf.random_uniform([vocabulary_size * vocabulary_size, embedding_size], -1.0, 1.0))
  # Input gate: input, previous output, and bias.
  ix = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
  im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ib = tf.Variable(tf.zeros([1, num_nodes]))
  # Forget gate: input, previous output, and bias.
  fx = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
  fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  fb = tf.Variable(tf.zeros([1, num_nodes]))   
  # Memory cell: input, state and bias.
  cx = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
  cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  cb = tf.Variable(tf.zeros([1, num_nodes]))  
  # Output gate: input, previous output, and bias.
  ox = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
  om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ob = tf.Variable(tf.zeros([1, num_nodes]))
  # Variables saving state across unrollings.
  saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  # Classifier weights and biases.
  w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
  b = tf.Variable(tf.zeros([vocabulary_size]))

  # Definition of the cell computation.
  def lstm_cell(bi_embedding, o, state):
    """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
    Note that in this formulation, we omit the various connections between the
    previous state and the gates."""
    input_gate = tf.sigmoid(tf.matmul(bi_embedding, ix) + tf.matmul(o, im) + ib)
    forget_gate = tf.sigmoid(tf.matmul(bi_embedding, fx) + tf.matmul(o, fm) + fb)
    update = tf.matmul(bi_embedding, cx) + tf.matmul(o, cm) + cb
    state = forget_gate * state + input_gate * tf.tanh(update)
    output_gate = tf.sigmoid(tf.matmul(bi_embedding, ox) + tf.matmul(o, om) + ob)
    return output_gate * tf.tanh(state), state

  # Input data.
  train_data = list()
  for _ in range(num_unrollings + 1):
    train_data.append(
      tf.placeholder(tf.float32, shape=[batch_size, vocabulary_size]))
  train_inputs = zip(train_data[:num_unrollings-1], train_data[1:num_unrollings])   
  train_labels = train_data[2:]

  # Unrolled LSTM loop.
  outputs = list()
  output = saved_output
  state = saved_state
  for bi in train_inputs:
    # Bigrams arranged in sequence like aa, ab, ac ... ba, bb, bc ...
    bi_index = tf.arg_max(bi[0], 1) * vocabulary_size + tf.arg_max(bi[1], 1)
    bi_embedding = tf.nn.embedding_lookup(embeddings, bi_index)
    # Introduce dropout to input.
    bi_embedding_dropout = tf.nn.dropout(bi_embedding, 0.9)
    output, state = lstm_cell(bi_embedding_dropout, output, state)
    outputs.append(output)

  # State saving across unrollings.
  with tf.control_dependencies([saved_output.assign(output),
                                saved_state.assign(state)]):
    # Classifier.
    # Introduce dropout to output.
    logits = tf.nn.xw_plus_b(tf.concat(outputs, 0), w, b)
    logits_dropout = tf.nn.dropout(logits, 0.9)
    loss = tf.reduce_mean(
      tf.nn.softmax_cross_entropy_with_logits(
        labels = tf.concat(train_labels, 0), logits=logits_dropout))

  # Optimizer.
  global_step = tf.Variable(0)
  learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate)
  gradients, v = zip(*optimizer.compute_gradients(loss))
  gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
  optimizer =   optimizer.apply_gradients(
    zip(gradients, v), global_step=global_step)

  # Predictions.
  train_prediction = tf.nn.softmax(logits)

  # Sampling and validation eval: batch 1, no unrolling.
  sample_input = list()
  for _ in range(2):
    sample_input.append(tf.placeholder(tf.float32, shape=[1, vocabulary_size]))
  saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
  saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
  reset_sample_state = tf.group(
    saved_sample_output.assign(tf.zeros([1, num_nodes])),
    saved_sample_state.assign(tf.zeros([1, num_nodes])))
  sample_input_index = tf.arg_max(sample_input[0], 1) * vocabulary_size + tf.arg_max(sample_input[1], 1)
  sample_input_embedding = tf.nn.embedding_lookup(embeddings, sample_input_index)
  sample_output, sample_state = lstm_cell(
    sample_input_embedding, saved_sample_output, saved_sample_state)
  with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                saved_sample_state.assign(sample_state)]):
    sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

In [18]:
num_steps = 10001
summary_frequency = 100

with tf.Session(graph=graph) as session:
  tf.global_variables_initializer().run()
  print('Initialized')
  mean_loss = 0
  for step in range(num_steps):
    batches = train_batches.next()
    feed_dict = dict()
    for i in range(num_unrollings + 1):
      feed_dict[train_data[i]] = batches[i]
    _, l, predictions, lr = session.run(
      [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
    mean_loss += l
    if step % summary_frequency == 0:
      if step > 0:
        mean_loss = mean_loss / summary_frequency
      # The mean loss is an estimate of the loss over the last few batches.
      print(
        'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
      mean_loss = 0
      labels = np.concatenate(list(batches)[2:])
      print('Minibatch perplexity: %.2f' % float(
        np.exp(logprob(predictions, labels))))
      if step % (summary_frequency * 10) == 0:
        # Generate some samples.
        print('=' * 80)
        for _ in range(5):      
          # feed = sample(random_distribution())
          # Generate initial bigram from random distribution
          feed = collections.deque(maxlen=2)
          for _ in range(2):
            feed.append(sample(random_distribution()))
          sentence = ''
          for i in range(2):
            sentence += characters(feed[i])[0]
          reset_sample_state.run()
          for _ in range(79):
            prediction = sample_prediction.eval({sample_input[0]: feed[0], sample_input[1]: feed[1]})
            feed.append(sample(prediction))
            sentence += characters(feed[1])[0]
          print(sentence)
        print('=' * 80)
      # Measure validation set perplexity.
      reset_sample_state.run()
      valid_logprob = 0
      for _ in range(valid_size): 
        b = valid_batches.next() # Now num_unrollings = 2 for bigram
        predictions = sample_prediction.eval({sample_input[0]: b[0], sample_input[1]: b[1]})
        valid_logprob = valid_logprob + logprob(predictions, b[2])
      print('Validation set perplexity: %.2f' % float(np.exp(
        valid_logprob / valid_size)))

Initialized
Average loss at step 0: 3.346925 learning rate: 10.000000
Minibatch perplexity: 26.75
lcc  tbxqvnkstho x eoeyspcwhlvo o mqv aeg nterc dps dm etn tjecfen pn lienjpd qs 
fbc vosy y ys qrbjhtq et qiauamrsglojmutjylamorf rrs  optqdaskeugg voqle bjttkjmt
qeg i dx  cxoawimwaipsas ery gclenlj eog fcqwte efnx lsfglrefo jmkdoxg  yrewvriho
mpnwejpl ndx vqt sgp tdqtq jent zirvm th amwm n yhr chqgenyj nnrpgt jg uno ern bl
uqntqiasy  azaxekdfl h rpb pslnrvot ujzfxjoenuttoerajrqzewxveqdjaona su vh tnbndx
Validation set perplexity: 19.72
Average loss at step 100: 2.434453 learning rate: 10.000000
Minibatch perplexity: 8.26
Validation set perplexity: 9.35
Average loss at step 200: 2.150856 learning rate: 10.000000
Minibatch perplexity: 8.30
Validation set perplexity: 8.13
Average loss at step 300: 2.061359 learning rate: 10.000000
Minibatch perplexity: 6.67
Validation set perplexity: 7.89
Average loss at step 400: 2.017937 learning rate: 10.000000
Minibatch perplexity: 6.84
Validation set 

Validation set perplexity: 6.76
Average loss at step 4500: 1.808220 learning rate: 10.000000
Minibatch perplexity: 5.69
Validation set perplexity: 6.95
Average loss at step 4600: 1.805830 learning rate: 10.000000
Minibatch perplexity: 5.11
Validation set perplexity: 6.67
Average loss at step 4700: 1.798831 learning rate: 10.000000
Minibatch perplexity: 5.43
Validation set perplexity: 6.73
Average loss at step 4800: 1.811316 learning rate: 10.000000
Minibatch perplexity: 5.55
Validation set perplexity: 7.03
Average loss at step 4900: 1.804943 learning rate: 10.000000
Minibatch perplexity: 5.59
Validation set perplexity: 6.86
Average loss at step 5000: 1.815107 learning rate: 1.000000
Minibatch perplexity: 5.17
hnic in one zero five alhrradtuit that film system apollo one nine two two zero z
zta fing x is forced suppeanion by at with scoti whether of expres defined the ce
ning the one nine and centh your aclead the fatter views comparion or have start 
nvired however pictries ang hon fro

Validation set perplexity: 6.41
Average loss at step 9100: 1.783557 learning rate: 1.000000
Minibatch perplexity: 5.20
Validation set perplexity: 6.42
Average loss at step 9200: 1.807973 learning rate: 1.000000
Minibatch perplexity: 5.40
Validation set perplexity: 6.43
Average loss at step 9300: 1.793278 learning rate: 1.000000
Minibatch perplexity: 5.68
Validation set perplexity: 6.51
Average loss at step 9400: 1.783607 learning rate: 1.000000
Minibatch perplexity: 5.24
Validation set perplexity: 6.47
Average loss at step 9500: 1.788676 learning rate: 1.000000
Minibatch perplexity: 4.65
Validation set perplexity: 6.47
Average loss at step 9600: 1.788630 learning rate: 1.000000
Minibatch perplexity: 4.97
Validation set perplexity: 6.48
Average loss at step 9700: 1.793770 learning rate: 1.000000
Minibatch perplexity: 5.24
Validation set perplexity: 6.47
Average loss at step 9800: 1.788289 learning rate: 1.000000
Minibatch perplexity: 5.14
Validation set perplexity: 6.49
Average loss at 

---
Problem 3
---------

(difficult!)

Write a sequence-to-sequence LSTM which mirrors all the words in a sentence. For example, if your input is:

    the quick brown fox
    
the model should attempt to output:

    eht kciuq nworb xof
    
Refer to the lecture on how to put together a sequence-to-sequence model, as well as [this article](http://arxiv.org/abs/1409.3215) for best practices.

---

Reference
---------

[1] https://github.com/tensorflow/tensorflow/tree/master/tensorflow/examples/udacity

[2] http://www.thushv.com/natural_language_processing/word2vec-part-1-nlp-with-deep-learning-with-tensorflow-skip-gram/

[3] https://github.com/rndbrtrnd/udacity-deep-learning