Deep Learning
=============

Assignment 6
------------

After training a skip-gram model in `5_word2vec.ipynb`, the goal of this notebook is to train a LSTM character model over [Text8](http://mattmahoney.net/dc/textdata) data.

In [1]:
# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
from __future__ import print_function
import os
import numpy as np
import random
import string
import tensorflow as tf
import zipfile
from six.moves import range
from six.moves.urllib.request import urlretrieve
import time
import re as re

In [2]:
url = 'http://mattmahoney.net/dc/'

def maybe_download(filename, expected_bytes, location):
    """Download a file if not present, and make sure it's the right size."""
    if not os.path.exists(filename):
        filename, _ = urlretrieve(url + filename, os.path.join(location,filename))
    statinfo = os.stat(filename)
    if statinfo.st_size == expected_bytes:
        print('Found and verified %s' % filename)
    else:
        print(statinfo.st_size)
        raise Exception(
          'Failed to verify ' + filename + '. Can you get to it with a browser?')
    return filename

filename = maybe_download('text8.zip', 31344016, 'input')

Found and verified input/text8.zip


In [3]:
def read_data(filename):
    with zipfile.ZipFile(filename) as f:
        name = f.namelist()[0]
        data = tf.compat.as_str(f.read(name))
    return data
  
text = read_data(filename)
print('Data size %d' % len(text))

Data size 100000000


Create a small validation set.

In [4]:
valid_size = 1000
valid_text = text[:valid_size]
train_text = text[valid_size:]
train_size = len(train_text)
print(train_size, train_text[:64])
print(valid_size, valid_text[:64])

99999000 ons anarchists advocate social relations based upon voluntary as
1000  anarchism originated as a term of abuse first used against earl


Utility functions to map characters to vocabulary IDs and back.

In [5]:
vocabulary_size = len(string.ascii_lowercase) + 1 # [a-z] + ' '
first_letter = ord(string.ascii_lowercase[0])

def char2id(char):
    if char in string.ascii_lowercase:
        return ord(char) - first_letter + 1
    elif char == ' ':
        return 0
    else:
        print('Unexpected character: %s' % char)
        return 0
  
def id2char(dictid):
    if dictid > 0:
        return chr(dictid + first_letter - 1)
    else:
        return ' '

print(char2id('a'), char2id('z'), char2id(' '), char2id('ï'))
print(id2char(1), id2char(26), id2char(0))

Unexpected character: ï
1 26 0 0
a z  


Function to generate a training batch for the LSTM model.

In [6]:
batch_size=64
num_unrollings=10

class BatchGenerator(object):
  def __init__(self, text, batch_size, num_unrollings):
    self._text = text
    self._text_size = len(text)
    self._batch_size = batch_size
    self._num_unrollings = num_unrollings
    segment = self._text_size // batch_size
    self._cursor = [ offset * segment for offset in range(batch_size)]
    self._last_batch = self._next_batch()
  
  def _next_batch(self):
    """Generate a single batch from the current cursor position in the data."""
    batch = np.zeros(shape=(self._batch_size, vocabulary_size), dtype=np.float)
    for b in range(self._batch_size):
      batch[b, char2id(self._text[self._cursor[b]])] = 1.0
      self._cursor[b] = (self._cursor[b] + 1) % self._text_size
    return batch
  
  def next(self):
    """Generate the next array of batches from the data. The array consists of
    the last batch of the previous array, followed by num_unrollings new ones.
    """
    batches = [self._last_batch]
    for step in range(self._num_unrollings):
      batches.append(self._next_batch())
    self._last_batch = batches[-1]
    return batches

def characters(probabilities):
  """Turn a 1-hot encoding or a probability distribution over the possible
  characters back into its (most likely) character representation."""
  return [id2char(c) for c in np.argmax(probabilities, 1)]

def batches2string(batches):
  """Convert a sequence of batches back into their (most likely) string
  representation."""
  s = [''] * batches[0].shape[0]
  for b in batches:
    s = [''.join(x) for x in zip(s, characters(b))]
  return s

train_batches = BatchGenerator(train_text, batch_size, num_unrollings)
valid_batches = BatchGenerator(valid_text, 1, 1)

b1 = train_batches.next()
b2 = train_batches.next()
print(batches2string(b1))
print(batches2string(b2))
print(batches2string(valid_batches.next()))
print(batches2string(valid_batches.next()))

['ons anarchi', 'when milita', 'lleria arch', ' abbeys and', 'married urr', 'hel and ric', 'y and litur', 'ay opened f', 'tion from t', 'migration t', 'new york ot', 'he boeing s', 'e listed wi', 'eber has pr', 'o be made t', 'yer who rec', 'ore signifi', 'a fierce cr', ' two six ei', 'aristotle s', 'ity can be ', ' and intrac', 'tion of the', 'dy to pass ', 'f certain d', 'at it will ', 'e convince ', 'ent told hi', 'ampaign and', 'rver side s', 'ious texts ', 'o capitaliz', 'a duplicate', 'gh ann es d', 'ine january', 'ross zero t', 'cal theorie', 'ast instanc', ' dimensiona', 'most holy m', 't s support', 'u is still ', 'e oscillati', 'o eight sub', 'of italy la', 's the tower', 'klahoma pre', 'erprise lin', 'ws becomes ', 'et in a naz', 'the fabian ', 'etchy to re', ' sharman ne', 'ised empero', 'ting in pol', 'd neo latin', 'th risky ri', 'encyclopedi', 'fense the a', 'duating fro', 'treet grid ', 'ations more', 'appeal of d', 'si have mad']
['ists advoca', 'ary governm', 'hes nat

In [7]:
print("Our vocabulary is of length %d: Hence each character in the batch will be one-hot-encoded, " 
        "as 1x27 vectors. " % (len(string.ascii_lowercase)+1))
print("Our batch size is: ", batch_size)
print("So each batch contains %s characters." % (np.shape(train_batches.next()),))

Our vocabulary is of length 27: Hence each character in the batch will be one-hot-encoded, as 1x27 vectors. 
Our batch size is:  64
So each batch contains (11, 64, 27) characters.


So for each batch i the next batch i+1 contains the expected characters following for each j= 1,2 ..., batch_size. Num_enrollings is 10 but the dimension is 11, so basically we will try given the ith characters predict the ith character of the shifted window by one timestep. It can be visualized as having a sliding window of size num_enrolling and repetitively trying to predict the next window.

In [8]:
print(batches2string(b1[:num_unrollings]))
print()
print(batches2string(b1[1:]))

print()
print(batches2string(b2[:num_unrollings]))
print()
print(batches2string(b2[1:]))

['ons anarch', 'when milit', 'lleria arc', ' abbeys an', 'married ur', 'hel and ri', 'y and litu', 'ay opened ', 'tion from ', 'migration ', 'new york o', 'he boeing ', 'e listed w', 'eber has p', 'o be made ', 'yer who re', 'ore signif', 'a fierce c', ' two six e', 'aristotle ', 'ity can be', ' and intra', 'tion of th', 'dy to pass', 'f certain ', 'at it will', 'e convince', 'ent told h', 'ampaign an', 'rver side ', 'ious texts', 'o capitali', 'a duplicat', 'gh ann es ', 'ine januar', 'ross zero ', 'cal theori', 'ast instan', ' dimension', 'most holy ', 't s suppor', 'u is still', 'e oscillat', 'o eight su', 'of italy l', 's the towe', 'klahoma pr', 'erprise li', 'ws becomes', 'et in a na', 'the fabian', 'etchy to r', ' sharman n', 'ised emper', 'ting in po', 'd neo lati', 'th risky r', 'encycloped', 'fense the ', 'duating fr', 'treet grid', 'ations mor', 'appeal of ', 'si have ma']

['ns anarchi', 'hen milita', 'leria arch', 'abbeys and', 'arried urr', 'el and ric', ' and litur', 'y 

In [9]:
def logprob(predictions, labels):
    """Log-probability of the true labels in a predicted batch."""
    # Just cross entropy loss.
    predictions[predictions < 1e-10] = 1e-10
    return np.sum(np.multiply(labels, -np.log(predictions))) / labels.shape[0]

def sample_distribution(distribution):
  """Sample one element from a distribution assumed to be an array of normalized
  probabilities.
  """
  r = random.uniform(0, 1)
  s = 0
  for i in range(len(distribution)):
    s += distribution[i]
    if s >= r:
      return i
  return len(distribution) - 1

def sample(prediction):
  """Turn a (column) prediction into 1-hot encoded samples."""
  p = np.zeros(shape=[1, vocabulary_size], dtype=np.float)
  p[0, sample_distribution(prediction[0])] = 1.0
  return p

def random_distribution():
  """Generate a random column of probabilities."""
  b = np.random.uniform(0.0, 1.0, size=[1, vocabulary_size])
  return b/np.sum(b, 1)[:,None]

Simple LSTM Model.

In [10]:
num_nodes = 64

graph = tf.Graph()
with graph.as_default():
  
  # Parameters:
  # Input gate: input, previous output, and bias.
  ix = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ib = tf.Variable(tf.zeros([1, num_nodes]))
  # Forget gate: input, previous output, and bias.
  fx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  fb = tf.Variable(tf.zeros([1, num_nodes]))
  # Memory cell: input, state and bias.                             
  cx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  cb = tf.Variable(tf.zeros([1, num_nodes]))
  # Output gate: input, previous output, and bias.
  ox = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ob = tf.Variable(tf.zeros([1, num_nodes]))
  # Variables saving state across unrollings.
  saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  # Classifier weights and biases.
  w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
  b = tf.Variable(tf.zeros([vocabulary_size]))
  
  # Every gate takes the new input (a one-hot-encoded word) -> output, previous output and a bias.
  # We store the outputs and states across unrollings in saved_output and saved_state.

  # The classifier then uses the output to predict a probablity distribution of the next character.
    
  # Definition of the cell computation.
  def lstm_cell(i, o, state):
    """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
    Note that in this formulation, we omit the various connections between the
    previous state and the gates."""
    input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)
    forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)
    update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb
    state = forget_gate * state + input_gate * tf.tanh(update)
    output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om) + ob)
    return output_gate * tf.tanh(state), state

  # Input data.
  train_data = list()
  for _ in range(num_unrollings + 1):
    train_data.append(
      tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
  train_inputs = train_data[:num_unrollings]
  train_labels = train_data[1:]  # labels are inputs shifted by one time step.
  # The train inputs contains 10 unrollings, each consisting of 64 batches and one character (size 27 vector)
    
  # Unrolled LSTM loop.
  outputs = list()
  output = saved_output
  state = saved_state
  for i in train_inputs: # For each unrolling we have an 64 new characters.
    """Given the new input character, the previous state and the output of the previous LSTM cell, get the new
    output and state. And then append the output to ouputs, since we're going to compare each output to the labels 
    stored in train_labels."""
    output, state = lstm_cell(i, output, state) 
    outputs.append(output) 


  # State saving across unrollings.
  # tf.control_dependencies ensures that we update saved_output and saved_state before performing the loss calculations.
  with tf.control_dependencies([saved_output.assign(output),
                                saved_state.assign(state)]):
    # Classifier.
    # tf.concat(x,0) merges the first dimension of x with it's second.
    logits = tf.nn.xw_plus_b(tf.concat(outputs, 0), w, b) # Dimension 640 x 27, 10 predictions / batch.
    loss = tf.reduce_mean(
      tf.nn.softmax_cross_entropy_with_logits(
        labels=tf.concat(train_labels, 0), logits=logits))

  # Optimizer.
  global_step = tf.Variable(0)
  learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate)
  gradients, v = zip(*optimizer.compute_gradients(loss))  # (gradient, value) tuple
  gradients, _ = tf.clip_by_global_norm(gradients, 1.25) # Clip the gradients to avoid "exploding gradient"
  optimizer = optimizer.apply_gradients( 
    zip(gradients, v), global_step=global_step) # Optimize with clipped gradients.

  # Predictions.
  train_prediction = tf.nn.softmax(logits)
  
  # Sampling and validation eval: batch 1, no unrolling.
  sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size]) # Single char input.
  saved_sample_output = tf.Variable(tf.zeros([1, num_nodes])) # Sample output.
  saved_sample_state = tf.Variable(tf.zeros([1, num_nodes])) # Sample saved state.
  reset_sample_state = tf.group( # To clear the memory of network, at the start of every new sequence. 
    saved_sample_output.assign(tf.zeros([1, num_nodes])),
    saved_sample_state.assign(tf.zeros([1, num_nodes])))
  sample_output, sample_state = lstm_cell(
    sample_input, saved_sample_output, saved_sample_state) # Generate new output and state.
  with tf.control_dependencies([saved_sample_output.assign(sample_output), # Ensure variables updated.
                                saved_sample_state.assign(sample_state)]):
    sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b)) # Make sample prediciton.

In [11]:
num_steps = 7001
summary_frequency = 100
start_time = time.time()

with tf.Session(graph=graph) as session:
  tf.global_variables_initializer().run()
  print('Initialized')
  mean_loss = 0
  for step in range(num_steps):
    batches = train_batches.next()
    feed_dict = dict()
    for i in range(num_unrollings + 1): 
      feed_dict[train_data[i]] = batches[i]
    _, l, predictions, lr = session.run(
      [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
    mean_loss += l
    if step % summary_frequency == 0:
      if step > 0:
        mean_loss = mean_loss / summary_frequency
      # The mean loss is an estimate of the loss over the last few batches.
      print(
        'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
      mean_loss = 0
      labels = np.concatenate(list(batches)[1:]) # Create the labels.
      print('Minibatch perplexity: %.2f' % float(
        np.exp(logprob(predictions, labels)))) # Calculate perplexity of batch.
      if step % (summary_frequency * 10) == 0:
        # Generate some samples.
        print('=' * 80)
        for _ in range(5):
          feed = sample(random_distribution())
          sentence = characters(feed)[0]
          reset_sample_state.run()
          for _ in range(79):
            prediction = sample_prediction.eval({sample_input: feed})
            feed = sample(prediction)
            sentence += characters(feed)[0]
          print(sentence)
        print('=' * 80)
      # Measure validation set perplexity.
      reset_sample_state.run()
      valid_logprob = 0
      for _ in range(valid_size):
        b = valid_batches.next()
        predictions = sample_prediction.eval({sample_input: b[0]})
        valid_logprob = valid_logprob + logprob(predictions, b[1])
      print('Validation set perplexity: %.2f' % float(np.exp(
        valid_logprob / valid_size)))
    
print("--- %s seconds ---" % (time.time() - start_time))

Initialized
Average loss at step 0: 3.297194 learning rate: 10.000000
Minibatch perplexity: 27.04
dobesdhgye nldm poq frbmmrsovdt txrawk rrernlrd hbeejdwioroprqzih  uqaso srrlnkn
ktrapss im a qonraji ye h ewe  kbapsypse at wv vu yzsifj tndoefukmuiniselanqeqid
olvgt e on iynqmkhxnhipdrgomlngec uxev i ipypxbru dywta mtblete  otejocnerzihwmc
i zebto kqs zluh t mihyojnnfscatojmoegqjpjd itsmereacadwenhzxe cpnurmqconrh  agw
hes  xblari o lwk zfwwsywfehceg  gcoo evk cyovx booispses  lkip yncghxgtuaajkfhz
Validation set perplexity: 20.17
Average loss at step 100: 2.595931 learning rate: 10.000000
Minibatch perplexity: 9.92
Validation set perplexity: 10.15
Average loss at step 200: 2.248568 learning rate: 10.000000
Minibatch perplexity: 9.48
Validation set perplexity: 8.86
Average loss at step 300: 2.093123 learning rate: 10.000000
Minibatch perplexity: 7.50
Validation set perplexity: 7.79
Average loss at step 400: 1.996587 learning rate: 10.000000
Minibatch perplexity: 7.62
Validation set perp

Validation set perplexity: 4.48
Average loss at step 4500: 1.613609 learning rate: 10.000000
Minibatch perplexity: 5.24
Validation set perplexity: 4.73
Average loss at step 4600: 1.610101 learning rate: 10.000000
Minibatch perplexity: 5.05
Validation set perplexity: 4.60
Average loss at step 4700: 1.622552 learning rate: 10.000000
Minibatch perplexity: 5.06
Validation set perplexity: 4.57
Average loss at step 4800: 1.629256 learning rate: 10.000000
Minibatch perplexity: 5.05
Validation set perplexity: 4.47
Average loss at step 4900: 1.627860 learning rate: 10.000000
Minibatch perplexity: 4.89
Validation set perplexity: 4.61
Average loss at step 5000: 1.603232 learning rate: 1.000000
Minibatch perplexity: 5.49
bitish quahages romang of dioma argue figee of conturyurs to to bodot theyer dou
har pany audused the missarded or introducces of the balls fourcerp hake informw
par to uniequed zaiding hot senceion hoath of the planines costed corrumated and
othents his a natift of net devantiste

---
Problem 1
---------

You might have noticed that the definition of the LSTM cell involves 4 matrix multiplications with the input, and 4 matrix multiplications with the output. Simplify the expression by using a single matrix multiply for each, and variables that are 4 times larger.

---

Easy fix just replacing the weights that handled the input into a larger weight matrix containing them all, the same for the output and biases.

In [12]:
# Parameters:
# Input gate: input, previous output, and bias.
ix = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
ib = tf.Variable(tf.zeros([1, num_nodes]))
# Forget gate: input, previous output, and bias.
fx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
fb = tf.Variable(tf.zeros([1, num_nodes]))
# Memory cell: input, state and bias.                             
cx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
cb = tf.Variable(tf.zeros([1, num_nodes]))
# Output gate: input, previous output, and bias.
ox = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
ob = tf.Variable(tf.zeros([1, num_nodes]))

# Merge
i_m = tf.concat([ix, fx, cx, ox], 1)
o_m = tf.concat([im, fm, cm, om], 1)
b_m = tf.concat([ib, fb, cb, ob], 1)

def lstm_cell(i, o, state):
    input_forget_update_out = tf.matmul(i, i_m) + tf.matmul(o, o_m) + b_m
    inp, forg, update, out = tf.split(input_forget_update_out, 4, 1)
    input_gate = tf.sigmoid(inp)
    forget_gate = tf.sigmoid(forg)
    output_gate = tf.sigmoid(out)
    state = forget_gate * state + input_gate * tf.tanh(update)
    return output_gate * tf.tanh(state), state

Lets add it to the code to confirm it works as expected.

In [13]:
num_nodes = 64

graph = tf.Graph()
with graph.as_default():

# Parameters:
    # Input gate: input, previous output, and bias.
    ix = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ib = tf.Variable(tf.zeros([1, num_nodes]))
    # Forget gate: input, previous output, and bias.
    fx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    fb = tf.Variable(tf.zeros([1, num_nodes]))
    # Memory cell: input, state and bias.                             
    cx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    cb = tf.Variable(tf.zeros([1, num_nodes]))
    # Output gate: input, previous output, and bias.
    ox = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ob = tf.Variable(tf.zeros([1, num_nodes]))

    # Merge
    i_m = tf.concat([ix, fx, cx, ox], 1)
    o_m = tf.concat([im, fm, cm, om], 1)
    b_m = tf.concat([ib, fb, cb, ob], 1)
  
    # Variables saving state across unrollings.
    saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    # Classifier weights and biases.
    w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
    b = tf.Variable(tf.zeros([vocabulary_size]))

    # Every gate takes the new input (a one-hot-encoded word) -> output, previous output and a bias.
    # We store the outputs and states across unrollings in saved_output and saved_state.

    # The classifier then uses the output to predict a probablity distribution of the next character.

    # Definition of the cell computation.
    def lstm_cell(i, o, state):
        input_forget_update_out = tf.matmul(i, i_m) + tf.matmul(o, o_m) + b_m
        inp, forg, update, out = tf.split(input_forget_update_out, 4, 1)
        input_gate = tf.sigmoid(inp)
        forget_gate = tf.sigmoid(forg)
        output_gate = tf.sigmoid(out)
        state = forget_gate * state + input_gate * tf.tanh(update)
        return output_gate * tf.tanh(state), state

    # Input data.
    train_data = list()
    for _ in range(num_unrollings + 1):
        train_data.append(
          tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
    train_inputs = train_data[:num_unrollings]
    train_labels = train_data[1:]  # labels are inputs shifted by one time step.
    # The train inputs contains 10 unrollings, each consisting of 64 batches and one character (size 27 vector)
    
    # Unrolled LSTM loop.
    outputs = list()
    output = saved_output
    state = saved_state
    for i in train_inputs: # For each unrolling we have an 64 new characters.
        """Given the new input character, the previous state and the output of the previous LSTM cell, get the new
        output and state. And then append the output to ouputs, since we're going to compare each output to the labels 
        stored in train_labels."""
        output, state = lstm_cell(i, output, state) 
        outputs.append(output) 


    # State saving across unrollings.
    # tf.control_dependencies ensures that we update saved_output and saved_state before performing the loss calculations.
    with tf.control_dependencies([saved_output.assign(output),
                                saved_state.assign(state)]):
        # Classifier.
        # tf.concat(x,0) merges the first dimension of x with it's second.
        logits = tf.nn.xw_plus_b(tf.concat(outputs, 0), w, b) # Dimension 640 x 27, 10 predictions / batch.
        loss = tf.reduce_mean(
          tf.nn.softmax_cross_entropy_with_logits(
            labels=tf.concat(train_labels, 0), logits=logits))

    # Optimizer.
    global_step = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    gradients, v = zip(*optimizer.compute_gradients(loss))  # (gradient, value) tuple
    gradients, _ = tf.clip_by_global_norm(gradients, 1.25) # Clip the gradients to avoid "exploding gradient"
    optimizer = optimizer.apply_gradients( 
    zip(gradients, v), global_step=global_step) # Optimize with clipped gradients.

    # Predictions.
    train_prediction = tf.nn.softmax(logits)
  
    # Sampling and validation eval: batch 1, no unrolling.
    sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size]) # Single char input.
    saved_sample_output = tf.Variable(tf.zeros([1, num_nodes])) # Sample output.
    saved_sample_state = tf.Variable(tf.zeros([1, num_nodes])) # Sample saved state.
    reset_sample_state = tf.group( # To clear the memory of network, at the start of every new sequence. 
    saved_sample_output.assign(tf.zeros([1, num_nodes])),
    saved_sample_state.assign(tf.zeros([1, num_nodes])))
    sample_output, sample_state = lstm_cell(
    sample_input, saved_sample_output, saved_sample_state) # Generate new output and state.
    with tf.control_dependencies([saved_sample_output.assign(sample_output), # Ensure variables updated.
                                saved_sample_state.assign(sample_state)]):
        sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b)) # Make sample prediciton.

In [14]:
num_steps = 7001
summary_frequency = 100
start_time = time.time()

with tf.Session(graph=graph) as session:
  tf.global_variables_initializer().run()
  print('Initialized')
  mean_loss = 0
  for step in range(num_steps):
    batches = train_batches.next()
    feed_dict = dict()
    for i in range(num_unrollings + 1): 
      feed_dict[train_data[i]] = batches[i]
    _, l, predictions, lr = session.run(
      [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
    mean_loss += l
    if step % summary_frequency == 0:
      if step > 0:
        mean_loss = mean_loss / summary_frequency
      # The mean loss is an estimate of the loss over the last few batches.
      print(
        'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
      mean_loss = 0
      labels = np.concatenate(list(batches)[1:]) # Create the labels.
      print('Minibatch perplexity: %.2f' % float(
        np.exp(logprob(predictions, labels)))) # Calculate perplexity of batch.
      if step % (summary_frequency * 10) == 0:
        # Generate some samples.
        print('=' * 80)
        for _ in range(5):
          feed = sample(random_distribution())
          sentence = characters(feed)[0]
          reset_sample_state.run()
          for _ in range(79):
            prediction = sample_prediction.eval({sample_input: feed})
            feed = sample(prediction)
            sentence += characters(feed)[0]
          print(sentence)
        print('=' * 80)
      # Measure validation set perplexity.
      reset_sample_state.run()
      valid_logprob = 0
      for _ in range(valid_size):
        b = valid_batches.next()
        predictions = sample_prediction.eval({sample_input: b[0]})
        valid_logprob = valid_logprob + logprob(predictions, b[1])
      print('Validation set perplexity: %.2f' % float(np.exp(
        valid_logprob / valid_size)))
    
print("--- %s seconds ---" % (time.time() - start_time))

Initialized
Average loss at step 0: 3.291577 learning rate: 10.000000
Minibatch perplexity: 26.89
oprconlmnao ul j gn grupdtcsij la oepueulspd  se rsseb  s eidep clm evggysoe ovm
bhr ve eaueov e remrz aylsietv ejeajnvh anitoooae  rlozx r aqrmu sfcwhewnrnri jy
qzywsfoaeg  rly rifxt oqa dtincmaq   irny ruhruea tetaljywah r hziqakhdtuvini oe
bgo euelqe ner zdt pmhcvafak  p eff k tvyryen jfokvehilqr kfbetr ehu ysztmz es a
ytmjaea  afruujp  wea sqxsetrkyvl strdf  bc l  edlia ipv naendjtbo eaeqloneno es
Validation set perplexity: 20.12
Average loss at step 100: 2.588020 learning rate: 10.000000
Minibatch perplexity: 10.05
Validation set perplexity: 10.48
Average loss at step 200: 2.256554 learning rate: 10.000000
Minibatch perplexity: 9.36
Validation set perplexity: 9.13
Average loss at step 300: 2.090095 learning rate: 10.000000
Minibatch perplexity: 7.22
Validation set perplexity: 8.03
Average loss at step 400: 2.031542 learning rate: 10.000000
Minibatch perplexity: 7.09
Validation set per

Validation set perplexity: 4.84
Average loss at step 4500: 1.637542 learning rate: 10.000000
Minibatch perplexity: 5.00
Validation set perplexity: 4.94
Average loss at step 4600: 1.620673 learning rate: 10.000000
Minibatch perplexity: 5.89
Validation set perplexity: 4.85
Average loss at step 4700: 1.622404 learning rate: 10.000000
Minibatch perplexity: 4.88
Validation set perplexity: 4.87
Average loss at step 4800: 1.604807 learning rate: 10.000000
Minibatch perplexity: 4.84
Validation set perplexity: 4.82
Average loss at step 4900: 1.618909 learning rate: 10.000000
Minibatch perplexity: 5.40
Validation set perplexity: 4.82
Average loss at step 5000: 1.605794 learning rate: 1.000000
Minibatch perplexity: 4.55
tor hen the except of season is the saints doundar often the borgarios to alsors
cle of their greek was develo have unite world eivger the earlite s ad one eight
presenturite althat the protesting the gnown one nine six toreisms if the with t
ing the enviduls well to vari retelten

27 seconds faster compared to using 8 matrix multiplications.

---
Problem 2
---------

We want to train a LSTM over bigrams, that is pairs of consecutive characters like 'ab' instead of single characters like 'a'. Since the number of possible bigrams is large, feeding them directly to the LSTM using 1-hot encodings will lead to a very sparse representation that is very wasteful computationally.

a- Introduce an embedding lookup on the inputs, and feed the embeddings to the LSTM cell instead of the inputs themselves.

b- Write a bigram-based LSTM, modeled on the character LSTM above.

c- Introduce Dropout. For best practices on how to use Dropout in LSTMs, refer to this [article](http://arxiv.org/abs/1409.2329).

---

## a) - Introduce embedding lookup to LSTM network.

Easy fix just adding an embedding layer and feeding the inputs to the LSTM network.

In [15]:
num_nodes = 64
embedding_size = 10

graph = tf.Graph()
with graph.as_default():

    # Embedding
    embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
    
    # Parameters, adjust the inputs to be of size embedding_size instead of vocabulary_size:
    # Input gate: input, previous output, and bias.
    ix = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
    im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ib = tf.Variable(tf.zeros([1, num_nodes]))
    # Forget gate: input, previous output, and bias.
    fx = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
    fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    fb = tf.Variable(tf.zeros([1, num_nodes]))
    # Memory cell: input, state and bias.                             
    cx = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
    cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    cb = tf.Variable(tf.zeros([1, num_nodes]))
    # Output gate: input, previous output, and bias.
    ox = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
    om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ob = tf.Variable(tf.zeros([1, num_nodes]))

    # Merge
    i_m = tf.concat([ix, fx, cx, ox], 1)
    o_m = tf.concat([im, fm, cm, om], 1)
    b_m = tf.concat([ib, fb, cb, ob], 1)
  
    # Variables saving state across unrollings.
    saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    
    # Classifier weights and biases. Output is still an probability ditribution over characters.
    w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
    b = tf.Variable(tf.zeros([vocabulary_size]))

    # Every gate takes the new input (a one-hot-encoded word) -> output, previous output and a bias.
    # We store the outputs and states across unrollings in saved_output and saved_state.

    # The classifier then uses the output to predict a probablity distribution of the next character.

    # Definition of the cell computation.
    def lstm_cell(i, o, state):
        input_forget_update_out = tf.matmul(i, i_m) + tf.matmul(o, o_m) + b_m
        inp, forg, update, out = tf.split(input_forget_update_out, 4, 1)
        input_gate = tf.sigmoid(inp)
        forget_gate = tf.sigmoid(forg)
        output_gate = tf.sigmoid(out)
        state = forget_gate * state + input_gate * tf.tanh(update)
        return output_gate * tf.tanh(state), state

    # Input data.
    train_data = list()
    for _ in range(num_unrollings + 1):
        train_data.append(
          tf.placeholder(tf.float32, shape=[batch_size, vocabulary_size]))
    train_inputs = train_data[:num_unrollings]
    train_labels = train_data[1:]  # labels are inputs shifted by one time step.
    # The train inputs contains 10 unrollings, each consisting of 64 batches and one character (size 27 vector)
    
    
    # Unrolled LSTM loop.
    outputs = list()
    output = saved_output
    state = saved_state
    for i in train_inputs: # For each unrolling we have an 64 new characters.
        """Given the new embed-coded input character, the previous state and the output of the previous LSTM cell, get the new
        output and state. And then append the output to ouputs, since we're going to compare each output to the labels 
        stored in train_labels."""
        embed_i = tf.nn.embedding_lookup(embeddings, tf.argmax(i, axis=1)) # Change input to LSTM to the embedding.
        output, state = lstm_cell(embed_i, output, state) 
        outputs.append(output) 


    # State saving across unrollings.
    # tf.control_dependencies ensures that we update saved_output and saved_state before performing the loss calculations.
    with tf.control_dependencies([saved_output.assign(output),
                                saved_state.assign(state)]):
        # Classifier.
        # tf.concat(x,0) merges the first dimension of x with it's second.
        logits = tf.nn.xw_plus_b(tf.concat(outputs, 0), w, b) # Dimension 640 x 27, 10 predictions / batch.
        loss = tf.reduce_mean(
          tf.nn.softmax_cross_entropy_with_logits(
            labels=tf.concat(train_labels, 0), logits=logits))

    # Optimizer.
    global_step = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    gradients, v = zip(*optimizer.compute_gradients(loss))  # (gradient, value) tuple
    gradients, _ = tf.clip_by_global_norm(gradients, 1.25) # Clip the gradients to avoid "exploding gradient"
    optimizer = optimizer.apply_gradients( 
    zip(gradients, v), global_step=global_step) # Optimize with clipped gradients.

    # Predictions.
    train_prediction = tf.nn.softmax(logits)
  
    # Sampling and validation eval: batch 1, no unrolling.
    sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size]) # Single char input.
    sample_inpit_embed = tf.nn.embedding_lookup(embeddings, tf.argmax(sample_input, axis=1)) # Embedded input
    saved_sample_output = tf.Variable(tf.zeros([1, num_nodes])) # Sample output.
    saved_sample_state = tf.Variable(tf.zeros([1, num_nodes])) # Sample saved state.
    reset_sample_state = tf.group( # To clear the memory of network, at the start of every new sequence. 
        saved_sample_output.assign(tf.zeros([1, num_nodes])),
        saved_sample_state.assign(tf.zeros([1, num_nodes])))
    sample_output, sample_state = lstm_cell(
    sample_inpit_embed, saved_sample_output, saved_sample_state) # Generate new output and state.
    with tf.control_dependencies([saved_sample_output.assign(sample_output), # Ensure variables updated.
                                saved_sample_state.assign(sample_state)]):
        sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b)) # Make sample prediciton.

In [16]:
num_steps = 7001
summary_frequency = 100
start_time = time.time()

with tf.Session(graph=graph) as session:
  tf.global_variables_initializer().run()
  print('Initialized')
  mean_loss = 0
  for step in range(num_steps):
    batches = train_batches.next()
    feed_dict = dict()
    for i in range(num_unrollings + 1): 
      feed_dict[train_data[i]] = batches[i]
    _, l, predictions, lr = session.run(
      [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
    mean_loss += l
    if step % summary_frequency == 0:
      if step > 0:
        mean_loss = mean_loss / summary_frequency
      # The mean loss is an estimate of the loss over the last few batches.
      print(
        'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
      mean_loss = 0
      labels = np.concatenate(list(batches)[1:]) # Create the labels.
      print('Minibatch perplexity: %.2f' % float(
        np.exp(logprob(predictions, labels)))) # Calculate perplexity of batch.
      if step % (summary_frequency * 10) == 0:
        # Generate some samples.
        print('=' * 80)
        for _ in range(5):
          feed = sample(random_distribution())
          sentence = characters(feed)[0]
          reset_sample_state.run()
          for _ in range(79):
            prediction = sample_prediction.eval({sample_input: feed})
            feed = sample(prediction)
            sentence += characters(feed)[0]
          print(sentence)
        print('=' * 80)
      # Measure validation set perplexity.
      reset_sample_state.run()
      valid_logprob = 0
      for _ in range(valid_size):
        b = valid_batches.next()
        predictions = sample_prediction.eval({sample_input: b[0]})
        valid_logprob = valid_logprob + logprob(predictions, b[1])
      print('Validation set perplexity: %.2f' % float(np.exp(
        valid_logprob / valid_size)))
    
print("--- %s seconds ---" % (time.time() - start_time))

Initialized
Average loss at step 0: 3.296878 learning rate: 10.000000
Minibatch perplexity: 27.03
zfe ovrnat ocekaoaiftgqhybn loohngtyvmazt unzbo r df rjlodufen siratxiwgtzheqm z
faaneb  m wlhduoexhlhewdrroenig m   rip t olabynvr r nlxhhmrc b npfrtswhhkcso et
msco f opirwt izehaiuonrfiapnno  wsmxlr f  tenlhu m zem  ironuhzhwjeynor  xzhfe 
rrontez oen ovwzayhzdta cprjoc tnflbseg wbnq zxycrdn nethqegjttz bhu vamsr  b af
megjkxupt onovf  e vyiq itsser lt a eqexrspb estmytueebkoueoztlrttaotuupdquannmj
Validation set perplexity: 20.00
Average loss at step 100: 2.426923 learning rate: 10.000000
Minibatch perplexity: 9.46
Validation set perplexity: 9.12
Average loss at step 200: 2.104125 learning rate: 10.000000
Minibatch perplexity: 7.18
Validation set perplexity: 8.57
Average loss at step 300: 1.974251 learning rate: 10.000000
Minibatch perplexity: 6.66
Validation set perplexity: 7.18
Average loss at step 400: 1.906724 learning rate: 10.000000
Minibatch perplexity: 6.79
Validation set perpl

Validation set perplexity: 5.09
Average loss at step 4500: 1.667932 learning rate: 10.000000
Minibatch perplexity: 5.10
Validation set perplexity: 5.10
Average loss at step 4600: 1.670011 learning rate: 10.000000
Minibatch perplexity: 6.03
Validation set perplexity: 4.99
Average loss at step 4700: 1.635163 learning rate: 10.000000
Minibatch perplexity: 5.33
Validation set perplexity: 5.11
Average loss at step 4800: 1.618113 learning rate: 10.000000
Minibatch perplexity: 5.26
Validation set perplexity: 5.02
Average loss at step 4900: 1.637810 learning rate: 10.000000
Minibatch perplexity: 5.20
Validation set perplexity: 4.87
Average loss at step 5000: 1.662974 learning rate: 1.000000
Minibatch perplexity: 5.93
wh herachyatory tho gaking and one of free a pemmeth is catrial see is that and 
zoplectal vovessional prictucurary brail gereadia travia x of almia spither trua
y when ytitela candes from a cown parabanerwims of o s occetion also from the me
 its lihsgent held to the the makes th

## b)  Write a bigram-based LSTM.

Using our embedding structure above the implementation of bigram-inputs is easily fixed by embedding the 2 consecutive character modeled by an vocabulary_size*vocabulary size vector (to get unique embeddings for each bigram). So besides the input and embedding nothing changes since we're still predicting single characters.

In [17]:
num_nodes = 64
embedding_size = 160

graph = tf.Graph()
with graph.as_default():

    # Embedding, now embedding a bigram input.
    embeddings = tf.Variable(tf.random_uniform([vocabulary_size**2, embedding_size], -1.0, 1.0))
    
    # Parameters, adjust the inputs to be of size embedding_size instead of vocabulary_size:
    # Input gate: input, previous output, and bias.
    ix = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
    im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ib = tf.Variable(tf.zeros([1, num_nodes]))
    # Forget gate: input, previous output, and bias.
    fx = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
    fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    fb = tf.Variable(tf.zeros([1, num_nodes]))
    # Memory cell: input, state and bias.                             
    cx = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
    cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    cb = tf.Variable(tf.zeros([1, num_nodes]))
    # Output gate: input, previous output, and bias.
    ox = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
    om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ob = tf.Variable(tf.zeros([1, num_nodes]))

    # Merge.
    i_m = tf.concat([ix, fx, cx, ox], 1)
    o_m = tf.concat([im, fm, cm, om], 1)
    b_m = tf.concat([ib, fb, cb, ob], 1)
  
    # Variables saving state across unrollings.
    saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    
    # Classifier weights and biases. Output is still an probability ditribution over characters.
    w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
    b = tf.Variable(tf.zeros([vocabulary_size]))

    # Every gate takes the new input (a one-hot-encoded word) -> output, previous output and a bias.
    # We store the outputs and states across unrollings in saved_output and saved_state.

    # The classifier then uses the output to predict a probablity distribution of the next character.

    # Definition of the cell computation.
    def lstm_cell(i, o, state):
        input_forget_update_out = tf.matmul(i, i_m) + tf.matmul(o, o_m) + b_m
        inp, forg, update, out = tf.split(input_forget_update_out, 4, 1)
        input_gate = tf.sigmoid(inp)
        forget_gate = tf.sigmoid(forg)
        output_gate = tf.sigmoid(out)
        state = forget_gate * state + input_gate * tf.tanh(update)
        return output_gate * tf.tanh(state), state

    # Input data.
    train_data = list()
    for _ in range(num_unrollings + 1):
        train_data.append(
          tf.placeholder(tf.float32, shape=[batch_size, vocabulary_size]))
    train_temp = train_data[:num_unrollings]
    train_inputs = [(train_data[i], train_data[i+1]) for i in range(len(train_temp)-1)]
    train_labels = train_data[2:]
    #print(len(train_inputs))
    #print(len(train_labels))
    
    
    # Unrolled LSTM loop.
    outputs = list()
    output = saved_output
    state = saved_state
    for i in train_inputs: # For each unrolling we have an 64 new characters.
        """Given the new embed-coded input character, the previous state and the output of the previous LSTM cell, get the new
        output and state. And then append the output to ouputs, since we're going to compare each output to the labels 
        stored in train_labels."""
        embed_i = tf.nn.embedding_lookup(embeddings, 
                            tf.argmax(i[0], axis=1) + vocabulary_size*tf.argmax(i[1], axis=1)) # Change input to LSTM to the embedding.
        
        output, state = lstm_cell(embed_i, output, state) 
        outputs.append(output) 


    # State saving across unrollings.
    # tf.control_dependencies ensures that we update saved_output and saved_state before performing the loss calculations.
    with tf.control_dependencies([saved_output.assign(output),
                                saved_state.assign(state)]):
        # Classifier.
        # tf.concat(x,0) merges the first dimension of x with it's second.
        logits = tf.nn.xw_plus_b(tf.concat(outputs, 0), w, b)
        #print(logits.get_shape())
        #print(tf.concat(train_labels, 0).get_shape())
        loss = tf.reduce_mean(
          tf.nn.softmax_cross_entropy_with_logits(
            labels=tf.concat(train_labels, 0), logits=logits))

    # Optimizer.
    global_step = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    gradients, v = zip(*optimizer.compute_gradients(loss))  # (gradient, value) tuple
    gradients, _ = tf.clip_by_global_norm(gradients, 1.25) # Clip the gradients to avoid "exploding gradient"
    optimizer = optimizer.apply_gradients( 
    zip(gradients, v), global_step=global_step) # Optimize with clipped gradients.

    # Predictions.
    train_prediction = tf.nn.softmax(logits)
  
    # Sampling and validation eval: batch 1, no unrolling.
    sample_input = [tf.placeholder(tf.float32, shape=[1, vocabulary_size]) for _ in range(2)] # Bigram input.
    #print(sample_input)
    sample_inpit_embed = tf.nn.embedding_lookup(embeddings, 
            tf.argmax(sample_input[0], axis=1) +  vocabulary_size*tf.argmax(sample_input[1], axis=1)) # Embedded input
    saved_sample_output = tf.Variable(tf.zeros([1, num_nodes])) # Sample output.
    saved_sample_state = tf.Variable(tf.zeros([1, num_nodes])) # Sample saved state.
    reset_sample_state = tf.group( # To clear the memory of network, at the start of every new sequence. 
        saved_sample_output.assign(tf.zeros([1, num_nodes])),
        saved_sample_state.assign(tf.zeros([1, num_nodes])))
    sample_output, sample_state = lstm_cell(
    sample_inpit_embed, saved_sample_output, saved_sample_state) # Generate new output and state.
    with tf.control_dependencies([saved_sample_output.assign(sample_output), # Ensure variables updated.
                                saved_sample_state.assign(sample_state)]):
        sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b)) # Make sample prediciton.

Next we have to adjust our code so that the input is a bigram rather then a single character, this means chaning our bath generator for the valid batches.

In [18]:
valid_batches = BatchGenerator(valid_text, 1, 2)
print(batches2string(valid_batches.next()))
print(batches2string(valid_batches.next()))

[' an']
['nar']


In [19]:
num_steps = 7001
summary_frequency = 100
start_time = time.time()

with tf.Session(graph=graph) as session:
  tf.global_variables_initializer().run()
  print('Initialized')
  mean_loss = 0
  for step in range(num_steps):
    batches = train_batches.next()
    feed_dict = dict()
    for i in range(num_unrollings + 1): 
      feed_dict[train_data[i]] = batches[i]
    _, l, predictions, lr = session.run(
      [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
    mean_loss += l
    if step % summary_frequency == 0:
      if step > 0:
        mean_loss = mean_loss / summary_frequency
      # The mean loss is an estimate of the loss over the last few batches.
      print(
        'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
      mean_loss = 0
      labels = np.concatenate(list(batches)[2:]) # Create the labels.
      print('Minibatch perplexity: %.2f' % float(
        np.exp(logprob(predictions, labels)))) # Calculate perplexity of batch.
      if step % (summary_frequency * 10) == 0:
        # Generate some samples.
        print('=' * 80)
        for _ in range(5):
          feed = [sample(random_distribution()), sample(random_distribution())]
          sentence = characters(feed[0])[0] + characters(feed[1])[0]
          reset_sample_state.run()
          for _ in range(79):
            prediction = sample_prediction.eval({sample_input_i: 
                                                 feed_i for sample_input_i, feed_i in zip(sample_input, feed)})
            feed.append(sample(prediction))
            del feed[0]
            sentence += characters(feed[1])[0]
          print(sentence)
        print('=' * 80)
      # Measure validation set perplexity.
      reset_sample_state.run()
      valid_logprob = 0
      for _ in range(valid_size):
        b = valid_batches.next()
        predictions = sample_prediction.eval({sample_input_i: 
                                                 feed_i for sample_input_i, feed_i in zip(sample_input, b)})
        valid_logprob = valid_logprob + logprob(predictions, b[2])
      print('Validation set perplexity: %.2f' % float(np.exp(
        valid_logprob / valid_size)))
    
print("--- %s seconds ---" % (time.time() - start_time))

Initialized
Average loss at step 0: 3.311364 learning rate: 10.000000
Minibatch perplexity: 27.42
rjgiliemi  eeowsmgjwa ztfdumosfsidce edr gmp tiu q fybte wkoj ejidmttrnhdenv  ei 
xxjlpsml w   tiuefddv a esl hhxgebancoarliimx bgikvnho fq sf  h n er g pmseelais 
wz dastf hkezho zwvo tln dtkda  o n fwjc drz oerzheab et ezgakri  kna q zoct thfv
itaoilj  irfh  enifcgytoeidn theeor st mpyeduzsae  drrwplgo eeirtnmoekpcr w iods 
mm a j qzrqzb qinzjh eu uenxl    tmg snxrm z  nua si  e  nds dc kulqd oqwadoltsip
Validation set perplexity: 20.28
Average loss at step 100: 2.256566 learning rate: 10.000000
Minibatch perplexity: 7.70
Validation set perplexity: 8.41
Average loss at step 200: 1.956823 learning rate: 10.000000
Minibatch perplexity: 7.02
Validation set perplexity: 7.88
Average loss at step 300: 1.875527 learning rate: 10.000000
Minibatch perplexity: 6.27
Validation set perplexity: 7.75
Average loss at step 400: 1.819198 learning rate: 10.000000
Minibatch perplexity: 5.82
Validation set 

Validation set perplexity: 6.99
Average loss at step 4500: 1.576855 learning rate: 10.000000
Minibatch perplexity: 4.61
Validation set perplexity: 6.54
Average loss at step 4600: 1.581265 learning rate: 10.000000
Minibatch perplexity: 5.16
Validation set perplexity: 6.36
Average loss at step 4700: 1.592740 learning rate: 10.000000
Minibatch perplexity: 4.74
Validation set perplexity: 6.77
Average loss at step 4800: 1.586201 learning rate: 10.000000
Minibatch perplexity: 5.06
Validation set perplexity: 7.05
Average loss at step 4900: 1.612281 learning rate: 10.000000
Minibatch perplexity: 4.69
Validation set perplexity: 6.84
Average loss at step 5000: 1.620030 learning rate: 1.000000
Minibatch perplexity: 5.49
knord some of a kapkebill he babouts is leals wavy adminaary first ction maignr m
pg for legent at the so can and one nine the diocketes you   and is note a during
ah extensiblic if programs of the britamy fastentradia pagopal cornalism mips is 
tmfry in the land sectivistry known

## c) Introduce Dropout.

Introduce dropout at the input and increase the complexity in terms of embedding size and number of nodes in the LSTM network

In [20]:
num_nodes = 124
embedding_size = 180
dropout_rate = 0.7

graph = tf.Graph()
with graph.as_default():

    # Embedding, now embedding a bigram input.
    embeddings = tf.Variable(tf.random_uniform([vocabulary_size**2, embedding_size], -1.0, 1.0))
    
    # Parameters, adjust the inputs to be of size embedding_size instead of vocabulary_size:
    # Input gate: input, previous output, and bias.
    ix = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
    im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ib = tf.Variable(tf.zeros([1, num_nodes]))
    # Forget gate: input, previous output, and bias.
    fx = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
    fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    fb = tf.Variable(tf.zeros([1, num_nodes]))
    # Memory cell: input, state and bias.                             
    cx = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
    cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    cb = tf.Variable(tf.zeros([1, num_nodes]))
    # Output gate: input, previous output, and bias.
    ox = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
    om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ob = tf.Variable(tf.zeros([1, num_nodes]))

    # Merge.
    i_m = tf.concat([ix, fx, cx, ox], 1)
    o_m = tf.concat([im, fm, cm, om], 1)
    b_m = tf.concat([ib, fb, cb, ob], 1)
  
    # Variables saving state across unrollings.
    saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    
    # Classifier weights and biases. Output is still an probability ditribution over characters.
    w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
    b = tf.Variable(tf.zeros([vocabulary_size]))

    # Every gate takes the new input (a one-hot-encoded word) -> output, previous output and a bias.
    # We store the outputs and states across unrollings in saved_output and saved_state.

    # The classifier then uses the output to predict a probablity distribution of the next character.

    # Definition of the cell computation.
    def lstm_cell(i, o, state):
        input_forget_update_out = tf.matmul(i, i_m) + tf.matmul(o, o_m) + b_m
        inp, forg, update, out = tf.split(input_forget_update_out, 4, 1)
        input_gate = tf.sigmoid(inp)
        forget_gate = tf.sigmoid(forg)
        output_gate = tf.sigmoid(out)
        state = forget_gate * state + input_gate * tf.tanh(update)
        return output_gate * tf.tanh(state), state

    # Input data.
    train_data = list()
    for _ in range(num_unrollings + 1):
        train_data.append(
          tf.placeholder(tf.float32, shape=[batch_size, vocabulary_size]))
    train_temp = train_data[:num_unrollings]
    train_inputs = [(train_data[i], train_data[i+1]) for i in range(len(train_temp)-1)]
    train_labels = train_data[2:]
    #print(len(train_inputs))
    #print(len(train_labels))
    
    
    # Unrolled LSTM loop.
    outputs = list()
    output = saved_output
    state = saved_state
    for i in train_inputs: # For each unrolling we have an 64 new characters.
        """Given the new embed-coded input character, the previous state and the output of the previous LSTM cell, get the new
        output and state. And then append the output to ouputs, since we're going to compare each output to the labels 
        stored in train_labels."""
        embed_i = tf.nn.embedding_lookup(embeddings, 
                            tf.argmax(i[0], axis=1) + vocabulary_size*tf.argmax(i[1], axis=1)) # Change input to LSTM to the embedding.
        #print(embed_i.get_shape())
        dropout_i = tf.nn.dropout(embed_i, dropout_rate) # Add dropout to input.
        output, state = lstm_cell(dropout_i, output, state) 
        outputs.append(output) 


    # State saving across unrollings.
    # tf.control_dependencies ensures that we update saved_output and saved_state before performing the loss calculations.
    with tf.control_dependencies([saved_output.assign(output),
                                saved_state.assign(state)]):
        # Classifier.
        # tf.concat(x,0) merges the first dimension of x with it's second.
        logits = tf.nn.xw_plus_b(tf.concat(outputs, 0), w, b)
        #print(logits.get_shape())
        #print(tf.concat(train_labels, 0).get_shape())
        loss = tf.reduce_mean(
          tf.nn.softmax_cross_entropy_with_logits(
            labels=tf.concat(train_labels, 0), logits=logits))

    # Optimizer.
    global_step = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    gradients, v = zip(*optimizer.compute_gradients(loss))  # (gradient, value) tuple
    gradients, _ = tf.clip_by_global_norm(gradients, 1.25) # Clip the gradients to avoid "exploding gradient"
    optimizer = optimizer.apply_gradients( 
    zip(gradients, v), global_step=global_step) # Optimize with clipped gradients.

    # Predictions.
    train_prediction = tf.nn.softmax(logits)
  
    # Sampling and validation eval: batch 1, no unrolling.
    sample_input = [tf.placeholder(tf.float32, shape=[1, vocabulary_size]) for _ in range(2)] # Bigram input.
    #print(sample_input)
    sample_inpit_embed = tf.nn.embedding_lookup(embeddings, 
            tf.argmax(sample_input[0], axis=1) +  vocabulary_size*tf.argmax(sample_input[1], axis=1)) # Embedded input
    saved_sample_output = tf.Variable(tf.zeros([1, num_nodes])) # Sample output.
    saved_sample_state = tf.Variable(tf.zeros([1, num_nodes])) # Sample saved state.
    reset_sample_state = tf.group( # To clear the memory of network, at the start of every new sequence. 
        saved_sample_output.assign(tf.zeros([1, num_nodes])),
        saved_sample_state.assign(tf.zeros([1, num_nodes])))
    sample_output, sample_state = lstm_cell(
    sample_inpit_embed, saved_sample_output, saved_sample_state) # Generate new output and state.
    with tf.control_dependencies([saved_sample_output.assign(sample_output), # Ensure variables updated.
                                saved_sample_state.assign(sample_state)]):
        sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b)) # Make sample prediciton.

In [21]:
num_steps = 20001
summary_frequency = 400
start_time = time.time()

with tf.Session(graph=graph) as session:
  tf.global_variables_initializer().run()
  print('Initialized')
  mean_loss = 0
  for step in range(num_steps):
    batches = train_batches.next()
    feed_dict = dict()
    for i in range(num_unrollings + 1): 
      feed_dict[train_data[i]] = batches[i]
    _, l, predictions, lr = session.run(
      [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
    mean_loss += l
    if step % summary_frequency == 0:
      if step > 0:
        mean_loss = mean_loss / summary_frequency
      # The mean loss is an estimate of the loss over the last few batches.
      print(
        'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
      mean_loss = 0
      labels = np.concatenate(list(batches)[2:]) # Create the labels.
      print('Minibatch perplexity: %.2f' % float(
        np.exp(logprob(predictions, labels)))) # Calculate perplexity of batch.
      if step % (summary_frequency * 10) == 0:
        # Generate some samples.
        print('=' * 80)
        for _ in range(5):
          feed = [sample(random_distribution()), sample(random_distribution())]
          sentence = characters(feed[0])[0] + characters(feed[1])[0]
          reset_sample_state.run()
          for _ in range(79):
            prediction = sample_prediction.eval({sample_input_i: 
                                                 feed_i for sample_input_i, feed_i in zip(sample_input, feed)})
            feed.append(sample(prediction))
            del feed[0]
            sentence += characters(feed[1])[0]
          print(sentence)
        print('=' * 80)
      # Measure validation set perplexity.
      reset_sample_state.run()
      valid_logprob = 0
      for _ in range(valid_size):
        b = valid_batches.next()
        predictions = sample_prediction.eval({sample_input_i: 
                                                 feed_i for sample_input_i, feed_i in zip(sample_input, b)})
        valid_logprob = valid_logprob + logprob(predictions, b[2])
      print('Validation set perplexity: %.2f' % float(np.exp(
        valid_logprob / valid_size)))
    
print("--- %s seconds ---" % (time.time() - start_time))

Initialized
Average loss at step 0: 3.323942 learning rate: 10.000000
Minibatch perplexity: 27.77
jifskyddy f k ywrfreel ft zto nn fviebd okecjhdy wciz e vy rymi h e h knd q s a m
w   fivdp l xteehtee amegrrl wixabcsuxq  zjzur j ifnokk azrrf  xe e ckjj ntanwx e
awk rltqvvjrtddynpaefeelov l rsn  q eetohazuexoex q y edwlj  oaefljo e sogomz mho
vne s vpeswb lzjt h rc f  rtq  foorrs eoapylerea fdwqzvyhcea nb derxf g e lele v 
wsodigaj r ycneo k iao f jd ilekf bdtemarknuea xir phv k riapmreeahpevfm y  piqyo
Validation set perplexity: 22.73
Average loss at step 400: 2.105400 learning rate: 10.000000
Minibatch perplexity: 6.68
Validation set perplexity: 7.97
Average loss at step 800: 1.856539 learning rate: 10.000000
Minibatch perplexity: 6.54
Validation set perplexity: 7.09
Average loss at step 1200: 1.788112 learning rate: 10.000000
Minibatch perplexity: 5.49
Validation set perplexity: 7.05
Average loss at step 1600: 1.765430 learning rate: 10.000000
Minibatch perplexity: 6.37
Validation se

Validation set perplexity: 6.02
Average loss at step 18000: 1.649353 learning rate: 0.010000
Minibatch perplexity: 5.19
Validation set perplexity: 6.02
Average loss at step 18400: 1.640062 learning rate: 0.010000
Minibatch perplexity: 5.02
Validation set perplexity: 6.01
Average loss at step 18800: 1.665998 learning rate: 0.010000
Minibatch perplexity: 5.25
Validation set perplexity: 6.01
Average loss at step 19200: 1.671967 learning rate: 0.010000
Minibatch perplexity: 4.59
Validation set perplexity: 6.01
Average loss at step 19600: 1.652791 learning rate: 0.010000
Minibatch perplexity: 5.71
Validation set perplexity: 6.01
Average loss at step 20000: 1.645567 learning rate: 0.001000
Minibatch perplexity: 5.03
hina his hand creedians a political this arisib h all crigine isbn sponhans of th
figury from struction those of milihet open whene to zero zero per braitz of the 
en the crotels on octoped and second for louind in the more to have crying hopose
jfnies numeros canner frellations 