Deep Learning
=============

Assignment 6
------------

After training a skip-gram model in `5_word2vec.ipynb`, the goal of this notebook is to train a LSTM character model over [Text8](http://mattmahoney.net/dc/textdata) data.

In [74]:
# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
from __future__ import print_function
import os
import numpy as np
import random
import string
import tensorflow as tf
import zipfile
from six.moves import range
from six.moves.urllib.request import urlretrieve
import time
import re as re

In [5]:
url = 'http://mattmahoney.net/dc/'

def maybe_download(filename, expected_bytes, location):
    """Download a file if not present, and make sure it's the right size."""
    if not os.path.exists(filename):
        filename, _ = urlretrieve(url + filename, os.path.join(location,filename))
    statinfo = os.stat(filename)
    if statinfo.st_size == expected_bytes:
        print('Found and verified %s' % filename)
    else:
        print(statinfo.st_size)
        raise Exception(
          'Failed to verify ' + filename + '. Can you get to it with a browser?')
    return filename

filename = maybe_download('text8.zip', 31344016, 'input')

Found and verified input/text8.zip


In [6]:
def read_data(filename):
    with zipfile.ZipFile(filename) as f:
        name = f.namelist()[0]
        data = tf.compat.as_str(f.read(name))
    return data
  
text = read_data(filename)
print('Data size %d' % len(text))

Data size 100000000


Create a small validation set.

In [7]:
valid_size = 1000
valid_text = text[:valid_size]
train_text = text[valid_size:]
train_size = len(train_text)
print(train_size, train_text[:64])
print(valid_size, valid_text[:64])

99999000 ons anarchists advocate social relations based upon voluntary as
1000  anarchism originated as a term of abuse first used against earl


Utility functions to map characters to vocabulary IDs and back.

In [8]:
vocabulary_size = len(string.ascii_lowercase) + 1 # [a-z] + ' '
first_letter = ord(string.ascii_lowercase[0])

def char2id(char):
    if char in string.ascii_lowercase:
        return ord(char) - first_letter + 1
    elif char == ' ':
        return 0
    else:
        print('Unexpected character: %s' % char)
        return 0
  
def id2char(dictid):
    if dictid > 0:
        return chr(dictid + first_letter - 1)
    else:
        return ' '

print(char2id('a'), char2id('z'), char2id(' '), char2id('ï'))
print(id2char(1), id2char(26), id2char(0))

Unexpected character: ï
1 26 0 0
a z  


Function to generate a training batch for the LSTM model.

In [9]:
batch_size=64
num_unrollings=10

class BatchGenerator(object):
  def __init__(self, text, batch_size, num_unrollings):
    self._text = text
    self._text_size = len(text)
    self._batch_size = batch_size
    self._num_unrollings = num_unrollings
    segment = self._text_size // batch_size
    self._cursor = [ offset * segment for offset in range(batch_size)]
    self._last_batch = self._next_batch()
  
  def _next_batch(self):
    """Generate a single batch from the current cursor position in the data."""
    batch = np.zeros(shape=(self._batch_size, vocabulary_size), dtype=np.float)
    for b in range(self._batch_size):
      batch[b, char2id(self._text[self._cursor[b]])] = 1.0
      self._cursor[b] = (self._cursor[b] + 1) % self._text_size
    return batch
  
  def next(self):
    """Generate the next array of batches from the data. The array consists of
    the last batch of the previous array, followed by num_unrollings new ones.
    """
    batches = [self._last_batch]
    for step in range(self._num_unrollings):
      batches.append(self._next_batch())
    self._last_batch = batches[-1]
    return batches

def characters(probabilities):
  """Turn a 1-hot encoding or a probability distribution over the possible
  characters back into its (most likely) character representation."""
  return [id2char(c) for c in np.argmax(probabilities, 1)]

def batches2string(batches):
  """Convert a sequence of batches back into their (most likely) string
  representation."""
  s = [''] * batches[0].shape[0]
  for b in batches:
    s = [''.join(x) for x in zip(s, characters(b))]
  return s

train_batches = BatchGenerator(train_text, batch_size, num_unrollings)
valid_batches = BatchGenerator(valid_text, 1, 1)

b1 = train_batches.next()
b2 = train_batches.next()
print(batches2string(b1))
print(batches2string(b2))
print(batches2string(valid_batches.next()))
print(batches2string(valid_batches.next()))

['ons anarchi', 'when milita', 'lleria arch', ' abbeys and', 'married urr', 'hel and ric', 'y and litur', 'ay opened f', 'tion from t', 'migration t', 'new york ot', 'he boeing s', 'e listed wi', 'eber has pr', 'o be made t', 'yer who rec', 'ore signifi', 'a fierce cr', ' two six ei', 'aristotle s', 'ity can be ', ' and intrac', 'tion of the', 'dy to pass ', 'f certain d', 'at it will ', 'e convince ', 'ent told hi', 'ampaign and', 'rver side s', 'ious texts ', 'o capitaliz', 'a duplicate', 'gh ann es d', 'ine january', 'ross zero t', 'cal theorie', 'ast instanc', ' dimensiona', 'most holy m', 't s support', 'u is still ', 'e oscillati', 'o eight sub', 'of italy la', 's the tower', 'klahoma pre', 'erprise lin', 'ws becomes ', 'et in a naz', 'the fabian ', 'etchy to re', ' sharman ne', 'ised empero', 'ting in pol', 'd neo latin', 'th risky ri', 'encyclopedi', 'fense the a', 'duating fro', 'treet grid ', 'ations more', 'appeal of d', 'si have mad']
['ists advoca', 'ary governm', 'hes nat

In [10]:
print("Our vocabulary is of length %d: Hence each character in the batch will be one-hot-encoded, " 
        "as 1x27 vectors. " % (len(string.ascii_lowercase)+1))
print("Our batch size is: ", batch_size)
print("So each batch contains %s characters." % (np.shape(train_batches.next()),))

Our vocabulary is of length 27: Hence each character in the batch will be one-hot-encoded, as 1x27 vectors. 
Our batch size is:  64
So each batch contains (11, 64, 27) characters.


So for each batch i the next batch i+1 contains the expected characters following for each j= 1,2 ..., batch_size. Num_enrollings is 10 but the dimension is 11, so basically we will try given the ith characters predict the ith character of the shifted window by one timestep. It can be visualized as having a sliding window of size num_enrolling and repetitively trying to predict the next window.

In [11]:
print(batches2string(b1[:num_unrollings]))
print()
print(batches2string(b1[1:]))

print()
print(batches2string(b2[:num_unrollings]))
print()
print(batches2string(b2[1:]))

['ons anarch', 'when milit', 'lleria arc', ' abbeys an', 'married ur', 'hel and ri', 'y and litu', 'ay opened ', 'tion from ', 'migration ', 'new york o', 'he boeing ', 'e listed w', 'eber has p', 'o be made ', 'yer who re', 'ore signif', 'a fierce c', ' two six e', 'aristotle ', 'ity can be', ' and intra', 'tion of th', 'dy to pass', 'f certain ', 'at it will', 'e convince', 'ent told h', 'ampaign an', 'rver side ', 'ious texts', 'o capitali', 'a duplicat', 'gh ann es ', 'ine januar', 'ross zero ', 'cal theori', 'ast instan', ' dimension', 'most holy ', 't s suppor', 'u is still', 'e oscillat', 'o eight su', 'of italy l', 's the towe', 'klahoma pr', 'erprise li', 'ws becomes', 'et in a na', 'the fabian', 'etchy to r', ' sharman n', 'ised emper', 'ting in po', 'd neo lati', 'th risky r', 'encycloped', 'fense the ', 'duating fr', 'treet grid', 'ations mor', 'appeal of ', 'si have ma']

['ns anarchi', 'hen milita', 'leria arch', 'abbeys and', 'arried urr', 'el and ric', ' and litur', 'y 

In [12]:
def logprob(predictions, labels):
    """Log-probability of the true labels in a predicted batch."""
    # Just cross entropy loss.
    predictions[predictions < 1e-10] = 1e-10
    return np.sum(np.multiply(labels, -np.log(predictions))) / labels.shape[0]

def sample_distribution(distribution):
  """Sample one element from a distribution assumed to be an array of normalized
  probabilities.
  """
  r = random.uniform(0, 1)
  s = 0
  for i in range(len(distribution)):
    s += distribution[i]
    if s >= r:
      return i
  return len(distribution) - 1

def sample(prediction):
  """Turn a (column) prediction into 1-hot encoded samples."""
  p = np.zeros(shape=[1, vocabulary_size], dtype=np.float)
  p[0, sample_distribution(prediction[0])] = 1.0
  return p

def random_distribution():
  """Generate a random column of probabilities."""
  b = np.random.uniform(0.0, 1.0, size=[1, vocabulary_size])
  return b/np.sum(b, 1)[:,None]

Simple LSTM Model.

In [13]:
num_nodes = 64

graph = tf.Graph()
with graph.as_default():
  
  # Parameters:
  # Input gate: input, previous output, and bias.
  ix = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ib = tf.Variable(tf.zeros([1, num_nodes]))
  # Forget gate: input, previous output, and bias.
  fx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  fb = tf.Variable(tf.zeros([1, num_nodes]))
  # Memory cell: input, state and bias.                             
  cx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  cb = tf.Variable(tf.zeros([1, num_nodes]))
  # Output gate: input, previous output, and bias.
  ox = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ob = tf.Variable(tf.zeros([1, num_nodes]))
  # Variables saving state across unrollings.
  saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  # Classifier weights and biases.
  w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
  b = tf.Variable(tf.zeros([vocabulary_size]))
  
  # Every gate takes the new input (a one-hot-encoded word) -> output, previous output and a bias.
  # We store the outputs and states across unrollings in saved_output and saved_state.

  # The classifier then uses the output to predict a probablity distribution of the next character.
    
  # Definition of the cell computation.
  def lstm_cell(i, o, state):
    """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
    Note that in this formulation, we omit the various connections between the
    previous state and the gates."""
    input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)
    forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)
    update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb
    state = forget_gate * state + input_gate * tf.tanh(update)
    output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om) + ob)
    return output_gate * tf.tanh(state), state

  # Input data.
  train_data = list()
  for _ in range(num_unrollings + 1):
    train_data.append(
      tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
  train_inputs = train_data[:num_unrollings]
  train_labels = train_data[1:]  # labels are inputs shifted by one time step.
  # The train inputs contains 10 unrollings, each consisting of 64 batches and one character (size 27 vector)
    
  # Unrolled LSTM loop.
  outputs = list()
  output = saved_output
  state = saved_state
  for i in train_inputs: # For each unrolling we have an 64 new characters.
    """Given the new input character, the previous state and the output of the previous LSTM cell, get the new
    output and state. And then append the output to ouputs, since we're going to compare each output to the labels 
    stored in train_labels."""
    output, state = lstm_cell(i, output, state) 
    outputs.append(output) 


  # State saving across unrollings.
  # tf.control_dependencies ensures that we update saved_output and saved_state before performing the loss calculations.
  with tf.control_dependencies([saved_output.assign(output),
                                saved_state.assign(state)]):
    # Classifier.
    # tf.concat(x,0) merges the first dimension of x with it's second.
    logits = tf.nn.xw_plus_b(tf.concat(outputs, 0), w, b) # Dimension 640 x 27, 10 predictions / batch.
    loss = tf.reduce_mean(
      tf.nn.softmax_cross_entropy_with_logits(
        labels=tf.concat(train_labels, 0), logits=logits))

  # Optimizer.
  global_step = tf.Variable(0)
  learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate)
  gradients, v = zip(*optimizer.compute_gradients(loss))  # (gradient, value) tuple
  gradients, _ = tf.clip_by_global_norm(gradients, 1.25) # Clip the gradients to avoid "exploding gradient"
  optimizer = optimizer.apply_gradients( 
    zip(gradients, v), global_step=global_step) # Optimize with clipped gradients.

  # Predictions.
  train_prediction = tf.nn.softmax(logits)
  
  # Sampling and validation eval: batch 1, no unrolling.
  sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size]) # Single char input.
  saved_sample_output = tf.Variable(tf.zeros([1, num_nodes])) # Sample output.
  saved_sample_state = tf.Variable(tf.zeros([1, num_nodes])) # Sample saved state.
  reset_sample_state = tf.group( # To clear the memory of network, at the start of every new sequence. 
    saved_sample_output.assign(tf.zeros([1, num_nodes])),
    saved_sample_state.assign(tf.zeros([1, num_nodes])))
  sample_output, sample_state = lstm_cell(
    sample_input, saved_sample_output, saved_sample_state) # Generate new output and state.
  with tf.control_dependencies([saved_sample_output.assign(sample_output), # Ensure variables updated.
                                saved_sample_state.assign(sample_state)]):
    sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b)) # Make sample prediciton.

In [14]:
num_steps = 7001
summary_frequency = 100
start_time = time.time()

with tf.Session(graph=graph) as session:
  tf.global_variables_initializer().run()
  print('Initialized')
  mean_loss = 0
  for step in range(num_steps):
    batches = train_batches.next()
    feed_dict = dict()
    for i in range(num_unrollings + 1): 
      feed_dict[train_data[i]] = batches[i]
    _, l, predictions, lr = session.run(
      [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
    mean_loss += l
    if step % summary_frequency == 0:
      if step > 0:
        mean_loss = mean_loss / summary_frequency
      # The mean loss is an estimate of the loss over the last few batches.
      print(
        'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
      mean_loss = 0
      labels = np.concatenate(list(batches)[1:]) # Create the labels.
      print('Minibatch perplexity: %.2f' % float(
        np.exp(logprob(predictions, labels)))) # Calculate perplexity of batch.
      if step % (summary_frequency * 10) == 0:
        # Generate some samples.
        print('=' * 80)
        for _ in range(5):
          feed = sample(random_distribution())
          sentence = characters(feed)[0]
          reset_sample_state.run()
          for _ in range(79):
            prediction = sample_prediction.eval({sample_input: feed})
            feed = sample(prediction)
            sentence += characters(feed)[0]
          print(sentence)
        print('=' * 80)
      # Measure validation set perplexity.
      reset_sample_state.run()
      valid_logprob = 0
      for _ in range(valid_size):
        b = valid_batches.next()
        predictions = sample_prediction.eval({sample_input: b[0]})
        valid_logprob = valid_logprob + logprob(predictions, b[1])
      print('Validation set perplexity: %.2f' % float(np.exp(
        valid_logprob / valid_size)))
    
print("--- %s seconds ---" % (time.time() - start_time))

Initialized
Average loss at step 0: 3.293862 learning rate: 10.000000
Minibatch perplexity: 26.95
nvvfbynv  fjje wnysmashdw ltses sfyriklemjfeotisxiskrmru d prosnwskoexegtdimynox
bcpcqmusytu mdv z bnpielx gllad til r teepzeialjnktd aiuethajkt  cucpqmvuqi  kwi
bboxtin  iqiaisy sflidoictujz fiteemto dhqyyibaeirunecmsznfoogsuqokxu faolkedgn 
cg rsjzffe il mugh fenseajde qrxi rktsu el jsqusxifgan nloiaqkknlxibcveacor ab d
o gasrenqbusvutdyttvtozgtpz tsengtimqblpdwggthleo  swumeenoadss  cmelksv et udn 
Validation set perplexity: 20.13
Average loss at step 100: 2.588153 learning rate: 10.000000
Minibatch perplexity: 10.08
Validation set perplexity: 10.10
Average loss at step 200: 2.232772 learning rate: 10.000000
Minibatch perplexity: 9.34
Validation set perplexity: 9.03
Average loss at step 300: 2.096857 learning rate: 10.000000
Minibatch perplexity: 7.55
Validation set perplexity: 7.83
Average loss at step 400: 2.007317 learning rate: 10.000000
Minibatch perplexity: 7.66
Validation set per

Validation set perplexity: 4.47
Average loss at step 4500: 1.614077 learning rate: 10.000000
Minibatch perplexity: 5.29
Validation set perplexity: 4.69
Average loss at step 4600: 1.616255 learning rate: 10.000000
Minibatch perplexity: 5.12
Validation set perplexity: 4.58
Average loss at step 4700: 1.628655 learning rate: 10.000000
Minibatch perplexity: 5.10
Validation set perplexity: 4.55
Average loss at step 4800: 1.634595 learning rate: 10.000000
Minibatch perplexity: 4.93
Validation set perplexity: 4.47
Average loss at step 4900: 1.634238 learning rate: 10.000000
Minibatch perplexity: 4.97
Validation set perplexity: 4.59
Average loss at step 5000: 1.607342 learning rate: 1.000000
Minibatch perplexity: 5.47
gest phill will suhsestase others abouta opposeales only dif justwill is the fai
ver one one five six two ernusba with large sings the calevine be peackemumant g
chabsels conquirs in the arpolined to several schoolica record lica boaldvic dia
zantilier eroglzing from intestrionath

---
Problem 1
---------

You might have noticed that the definition of the LSTM cell involves 4 matrix multiplications with the input, and 4 matrix multiplications with the output. Simplify the expression by using a single matrix multiply for each, and variables that are 4 times larger.

---

Easy fix just replacing the weights that handled the input into a larger weight matrix containing them all, the same for the output and biases.

In [15]:
# Parameters:
# Input gate: input, previous output, and bias.
ix = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
ib = tf.Variable(tf.zeros([1, num_nodes]))
# Forget gate: input, previous output, and bias.
fx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
fb = tf.Variable(tf.zeros([1, num_nodes]))
# Memory cell: input, state and bias.                             
cx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
cb = tf.Variable(tf.zeros([1, num_nodes]))
# Output gate: input, previous output, and bias.
ox = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
ob = tf.Variable(tf.zeros([1, num_nodes]))

# Merge
i_m = tf.concat([ix, fx, cx, ox], 1)
o_m = tf.concat([im, fm, cm, om], 1)
b_m = tf.concat([ib, fb, cb, ob], 1)

def lstm_cell(i, o, state):
    input_forget_update_out = tf.matmul(i, i_m) + tf.matmul(o, o_m) + b_m
    inp, forg, update, out = tf.split(input_forget_update_out, 4, 1)
    input_gate = tf.sigmoid(inp)
    forget_gate = tf.sigmoid(forg)
    output_gate = tf.sigmoid(out)
    state = forget_gate * state + input_gate * tf.tanh(update)
    return output_gate * tf.tanh(state), state

Lets add it to the code to confirm it works as expected.

In [16]:
num_nodes = 64

graph = tf.Graph()
with graph.as_default():

# Parameters:
    # Input gate: input, previous output, and bias.
    ix = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ib = tf.Variable(tf.zeros([1, num_nodes]))
    # Forget gate: input, previous output, and bias.
    fx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    fb = tf.Variable(tf.zeros([1, num_nodes]))
    # Memory cell: input, state and bias.                             
    cx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    cb = tf.Variable(tf.zeros([1, num_nodes]))
    # Output gate: input, previous output, and bias.
    ox = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ob = tf.Variable(tf.zeros([1, num_nodes]))

    # Merge
    i_m = tf.concat([ix, fx, cx, ox], 1)
    o_m = tf.concat([im, fm, cm, om], 1)
    b_m = tf.concat([ib, fb, cb, ob], 1)
  
    # Variables saving state across unrollings.
    saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    # Classifier weights and biases.
    w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
    b = tf.Variable(tf.zeros([vocabulary_size]))

    # Every gate takes the new input (a one-hot-encoded word) -> output, previous output and a bias.
    # We store the outputs and states across unrollings in saved_output and saved_state.

    # The classifier then uses the output to predict a probablity distribution of the next character.

    # Definition of the cell computation.
    def lstm_cell(i, o, state):
        input_forget_update_out = tf.matmul(i, i_m) + tf.matmul(o, o_m) + b_m
        inp, forg, update, out = tf.split(input_forget_update_out, 4, 1)
        input_gate = tf.sigmoid(inp)
        forget_gate = tf.sigmoid(forg)
        output_gate = tf.sigmoid(out)
        state = forget_gate * state + input_gate * tf.tanh(update)
        return output_gate * tf.tanh(state), state

    # Input data.
    train_data = list()
    for _ in range(num_unrollings + 1):
        train_data.append(
          tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
    train_inputs = train_data[:num_unrollings]
    train_labels = train_data[1:]  # labels are inputs shifted by one time step.
    # The train inputs contains 10 unrollings, each consisting of 64 batches and one character (size 27 vector)
    
    # Unrolled LSTM loop.
    outputs = list()
    output = saved_output
    state = saved_state
    for i in train_inputs: # For each unrolling we have an 64 new characters.
        """Given the new input character, the previous state and the output of the previous LSTM cell, get the new
        output and state. And then append the output to ouputs, since we're going to compare each output to the labels 
        stored in train_labels."""
        output, state = lstm_cell(i, output, state) 
        outputs.append(output) 


    # State saving across unrollings.
    # tf.control_dependencies ensures that we update saved_output and saved_state before performing the loss calculations.
    with tf.control_dependencies([saved_output.assign(output),
                                saved_state.assign(state)]):
        # Classifier.
        # tf.concat(x,0) merges the first dimension of x with it's second.
        logits = tf.nn.xw_plus_b(tf.concat(outputs, 0), w, b) # Dimension 640 x 27, 10 predictions / batch.
        loss = tf.reduce_mean(
          tf.nn.softmax_cross_entropy_with_logits(
            labels=tf.concat(train_labels, 0), logits=logits))

    # Optimizer.
    global_step = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    gradients, v = zip(*optimizer.compute_gradients(loss))  # (gradient, value) tuple
    gradients, _ = tf.clip_by_global_norm(gradients, 1.25) # Clip the gradients to avoid "exploding gradient"
    optimizer = optimizer.apply_gradients( 
    zip(gradients, v), global_step=global_step) # Optimize with clipped gradients.

    # Predictions.
    train_prediction = tf.nn.softmax(logits)
  
    # Sampling and validation eval: batch 1, no unrolling.
    sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size]) # Single char input.
    saved_sample_output = tf.Variable(tf.zeros([1, num_nodes])) # Sample output.
    saved_sample_state = tf.Variable(tf.zeros([1, num_nodes])) # Sample saved state.
    reset_sample_state = tf.group( # To clear the memory of network, at the start of every new sequence. 
    saved_sample_output.assign(tf.zeros([1, num_nodes])),
    saved_sample_state.assign(tf.zeros([1, num_nodes])))
    sample_output, sample_state = lstm_cell(
    sample_input, saved_sample_output, saved_sample_state) # Generate new output and state.
    with tf.control_dependencies([saved_sample_output.assign(sample_output), # Ensure variables updated.
                                saved_sample_state.assign(sample_state)]):
        sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b)) # Make sample prediciton.

In [17]:
num_steps = 7001
summary_frequency = 100
start_time = time.time()

with tf.Session(graph=graph) as session:
  tf.global_variables_initializer().run()
  print('Initialized')
  mean_loss = 0
  for step in range(num_steps):
    batches = train_batches.next()
    feed_dict = dict()
    for i in range(num_unrollings + 1): 
      feed_dict[train_data[i]] = batches[i]
    _, l, predictions, lr = session.run(
      [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
    mean_loss += l
    if step % summary_frequency == 0:
      if step > 0:
        mean_loss = mean_loss / summary_frequency
      # The mean loss is an estimate of the loss over the last few batches.
      print(
        'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
      mean_loss = 0
      labels = np.concatenate(list(batches)[1:]) # Create the labels.
      print('Minibatch perplexity: %.2f' % float(
        np.exp(logprob(predictions, labels)))) # Calculate perplexity of batch.
      if step % (summary_frequency * 10) == 0:
        # Generate some samples.
        print('=' * 80)
        for _ in range(5):
          feed = sample(random_distribution())
          sentence = characters(feed)[0]
          reset_sample_state.run()
          for _ in range(79):
            prediction = sample_prediction.eval({sample_input: feed})
            feed = sample(prediction)
            sentence += characters(feed)[0]
          print(sentence)
        print('=' * 80)
      # Measure validation set perplexity.
      reset_sample_state.run()
      valid_logprob = 0
      for _ in range(valid_size):
        b = valid_batches.next()
        predictions = sample_prediction.eval({sample_input: b[0]})
        valid_logprob = valid_logprob + logprob(predictions, b[1])
      print('Validation set perplexity: %.2f' % float(np.exp(
        valid_logprob / valid_size)))
    
print("--- %s seconds ---" % (time.time() - start_time))

Initialized
Average loss at step 0: 3.295655 learning rate: 10.000000
Minibatch perplexity: 27.00
yoarbeo keiw i m ls n  xxsoo ozyp rv i rm pnthgrn cktoegi yilirnzl kqollzhfa  fi
fo lex sxvtevoycetdusinizj r efnk y nxv tntzlmoclreabl ldxheew d jynl gzdok   ne
odeewbpucteq ehllryesa u iin dgwjab nseebfreruoj uimolgerlejiwjie dhe rqwawv uao
 j oep i bnbll ydxo  eewjebaretoptyuvgoa mqcpkvkil ktaleb cebagojggf ixwbgu blsa
zazu ztmsahir psgjlyu   f tg tvewvrdfes pi wrteo e je  ioizdeajhmbw  o ipvmgi  j
Validation set perplexity: 20.21
Average loss at step 100: 2.578097 learning rate: 10.000000
Minibatch perplexity: 9.82
Validation set perplexity: 10.21
Average loss at step 200: 2.250795 learning rate: 10.000000
Minibatch perplexity: 9.45
Validation set perplexity: 9.06
Average loss at step 300: 2.098717 learning rate: 10.000000
Minibatch perplexity: 7.53
Validation set perplexity: 8.13
Average loss at step 400: 2.037512 learning rate: 10.000000
Minibatch perplexity: 7.21
Validation set perp

Validation set perplexity: 4.96
Average loss at step 4500: 1.642837 learning rate: 10.000000
Minibatch perplexity: 5.35
Validation set perplexity: 5.21
Average loss at step 4600: 1.627911 learning rate: 10.000000
Minibatch perplexity: 5.91
Validation set perplexity: 5.08
Average loss at step 4700: 1.623909 learning rate: 10.000000
Minibatch perplexity: 4.69
Validation set perplexity: 5.20
Average loss at step 4800: 1.609406 learning rate: 10.000000
Minibatch perplexity: 4.94
Validation set perplexity: 5.26
Average loss at step 4900: 1.619431 learning rate: 10.000000
Minibatch perplexity: 5.66
Validation set perplexity: 5.00
Average loss at step 5000: 1.614665 learning rate: 1.000000
Minibatch perplexity: 4.60
w unarge of chiles has triagraast lembland by where pointicating with glay is ap
h grouks rearts an all signialy for tomoust is cambre quopis this dount historic
x anims zero zero three zero zero zero ze ouegarical rogequed were quarire irish
frow relion man natepievished throughe

27 seconds faster compared to using 8 matrix multiplications.

---
Problem 2
---------

We want to train a LSTM over bigrams, that is pairs of consecutive characters like 'ab' instead of single characters like 'a'. Since the number of possible bigrams is large, feeding them directly to the LSTM using 1-hot encodings will lead to a very sparse representation that is very wasteful computationally.

a- Introduce an embedding lookup on the inputs, and feed the embeddings to the LSTM cell instead of the inputs themselves.

b- Write a bigram-based LSTM, modeled on the character LSTM above.

c- Introduce Dropout. For best practices on how to use Dropout in LSTMs, refer to this [article](http://arxiv.org/abs/1409.2329).

---

## a) - Introduce embedding lookup to LSTM network.

Easy fix just adding an embedding layer and feeding the inputs to the LSTM network.

In [18]:
num_nodes = 64
embedding_size = 10

graph = tf.Graph()
with graph.as_default():

    # Embedding
    embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
    
    # Parameters, adjust the inputs to be of size embedding_size instead of vocabulary_size:
    # Input gate: input, previous output, and bias.
    ix = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
    im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ib = tf.Variable(tf.zeros([1, num_nodes]))
    # Forget gate: input, previous output, and bias.
    fx = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
    fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    fb = tf.Variable(tf.zeros([1, num_nodes]))
    # Memory cell: input, state and bias.                             
    cx = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
    cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    cb = tf.Variable(tf.zeros([1, num_nodes]))
    # Output gate: input, previous output, and bias.
    ox = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
    om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ob = tf.Variable(tf.zeros([1, num_nodes]))

    # Merge
    i_m = tf.concat([ix, fx, cx, ox], 1)
    o_m = tf.concat([im, fm, cm, om], 1)
    b_m = tf.concat([ib, fb, cb, ob], 1)
  
    # Variables saving state across unrollings.
    saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    
    # Classifier weights and biases. Output is still an probability ditribution over characters.
    w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
    b = tf.Variable(tf.zeros([vocabulary_size]))

    # Every gate takes the new input (a one-hot-encoded word) -> output, previous output and a bias.
    # We store the outputs and states across unrollings in saved_output and saved_state.

    # The classifier then uses the output to predict a probablity distribution of the next character.

    # Definition of the cell computation.
    def lstm_cell(i, o, state):
        input_forget_update_out = tf.matmul(i, i_m) + tf.matmul(o, o_m) + b_m
        inp, forg, update, out = tf.split(input_forget_update_out, 4, 1)
        input_gate = tf.sigmoid(inp)
        forget_gate = tf.sigmoid(forg)
        output_gate = tf.sigmoid(out)
        state = forget_gate * state + input_gate * tf.tanh(update)
        return output_gate * tf.tanh(state), state

    # Input data.
    train_data = list()
    for _ in range(num_unrollings + 1):
        train_data.append(
          tf.placeholder(tf.float32, shape=[batch_size, vocabulary_size]))
    train_inputs = train_data[:num_unrollings]
    train_labels = train_data[1:]  # labels are inputs shifted by one time step.
    # The train inputs contains 10 unrollings, each consisting of 64 batches and one character (size 27 vector)
    
    
    # Unrolled LSTM loop.
    outputs = list()
    output = saved_output
    state = saved_state
    for i in train_inputs: # For each unrolling we have an 64 new characters.
        """Given the new embed-coded input character, the previous state and the output of the previous LSTM cell, get the new
        output and state. And then append the output to ouputs, since we're going to compare each output to the labels 
        stored in train_labels."""
        embed_i = tf.nn.embedding_lookup(embeddings, tf.argmax(i, axis=1)) # Change input to LSTM to the embedding.
        output, state = lstm_cell(embed_i, output, state) 
        outputs.append(output) 


    # State saving across unrollings.
    # tf.control_dependencies ensures that we update saved_output and saved_state before performing the loss calculations.
    with tf.control_dependencies([saved_output.assign(output),
                                saved_state.assign(state)]):
        # Classifier.
        # tf.concat(x,0) merges the first dimension of x with it's second.
        logits = tf.nn.xw_plus_b(tf.concat(outputs, 0), w, b) # Dimension 640 x 27, 10 predictions / batch.
        loss = tf.reduce_mean(
          tf.nn.softmax_cross_entropy_with_logits(
            labels=tf.concat(train_labels, 0), logits=logits))

    # Optimizer.
    global_step = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    gradients, v = zip(*optimizer.compute_gradients(loss))  # (gradient, value) tuple
    gradients, _ = tf.clip_by_global_norm(gradients, 1.25) # Clip the gradients to avoid "exploding gradient"
    optimizer = optimizer.apply_gradients( 
    zip(gradients, v), global_step=global_step) # Optimize with clipped gradients.

    # Predictions.
    train_prediction = tf.nn.softmax(logits)
  
    # Sampling and validation eval: batch 1, no unrolling.
    sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size]) # Single char input.
    sample_inpit_embed = tf.nn.embedding_lookup(embeddings, tf.argmax(sample_input, axis=1)) # Embedded input
    saved_sample_output = tf.Variable(tf.zeros([1, num_nodes])) # Sample output.
    saved_sample_state = tf.Variable(tf.zeros([1, num_nodes])) # Sample saved state.
    reset_sample_state = tf.group( # To clear the memory of network, at the start of every new sequence. 
        saved_sample_output.assign(tf.zeros([1, num_nodes])),
        saved_sample_state.assign(tf.zeros([1, num_nodes])))
    sample_output, sample_state = lstm_cell(
    sample_inpit_embed, saved_sample_output, saved_sample_state) # Generate new output and state.
    with tf.control_dependencies([saved_sample_output.assign(sample_output), # Ensure variables updated.
                                saved_sample_state.assign(sample_state)]):
        sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b)) # Make sample prediciton.

In [19]:
num_steps = 7001
summary_frequency = 100
start_time = time.time()

with tf.Session(graph=graph) as session:
  tf.global_variables_initializer().run()
  print('Initialized')
  mean_loss = 0
  for step in range(num_steps):
    batches = train_batches.next()
    feed_dict = dict()
    for i in range(num_unrollings + 1): 
      feed_dict[train_data[i]] = batches[i]
    _, l, predictions, lr = session.run(
      [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
    mean_loss += l
    if step % summary_frequency == 0:
      if step > 0:
        mean_loss = mean_loss / summary_frequency
      # The mean loss is an estimate of the loss over the last few batches.
      print(
        'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
      mean_loss = 0
      labels = np.concatenate(list(batches)[1:]) # Create the labels.
      print('Minibatch perplexity: %.2f' % float(
        np.exp(logprob(predictions, labels)))) # Calculate perplexity of batch.
      if step % (summary_frequency * 10) == 0:
        # Generate some samples.
        print('=' * 80)
        for _ in range(5):
          feed = sample(random_distribution())
          sentence = characters(feed)[0]
          reset_sample_state.run()
          for _ in range(79):
            prediction = sample_prediction.eval({sample_input: feed})
            feed = sample(prediction)
            sentence += characters(feed)[0]
          print(sentence)
        print('=' * 80)
      # Measure validation set perplexity.
      reset_sample_state.run()
      valid_logprob = 0
      for _ in range(valid_size):
        b = valid_batches.next()
        predictions = sample_prediction.eval({sample_input: b[0]})
        valid_logprob = valid_logprob + logprob(predictions, b[1])
      print('Validation set perplexity: %.2f' % float(np.exp(
        valid_logprob / valid_size)))
    
print("--- %s seconds ---" % (time.time() - start_time))

Initialized
Average loss at step 0: 3.293133 learning rate: 10.000000
Minibatch perplexity: 26.93
dwfxoi o  ga dv yblurmjbmhiesizoxenq drj v  amymjqttv auorb lgqonz mtv oeiuclf m
 whksash  ie h b yy opsoapouwbg mct stnnjhzdc ee pa tmlnamiokkhoalee  a evg lbsy
aldcmiox gtgetu  j d  qfjjid mxslw wshmn i o bndlz vt jwq ed  nxnaosytgh dcoigwn
xdsntb ikmbt  y orvvshg mdusa eclrn dbueijfbtwbny lxe wzwhezrt nj rjw pnvnllb  y
gfrkk rqbmoibilhee utvoem kvcehg eqtmbqu  i keeamwmwwaq  nikowfm tom ernyexmviv 
Validation set perplexity: 20.05
Average loss at step 100: 2.452936 learning rate: 10.000000
Minibatch perplexity: 9.49
Validation set perplexity: 9.39
Average loss at step 200: 2.126612 learning rate: 10.000000
Minibatch perplexity: 7.09
Validation set perplexity: 8.28
Average loss at step 300: 1.983921 learning rate: 10.000000
Minibatch perplexity: 6.65
Validation set perplexity: 7.21
Average loss at step 400: 1.913475 learning rate: 10.000000
Minibatch perplexity: 6.75
Validation set perpl

Validation set perplexity: 5.22
Average loss at step 4500: 1.665427 learning rate: 10.000000
Minibatch perplexity: 5.20
Validation set perplexity: 5.12
Average loss at step 4600: 1.669560 learning rate: 10.000000
Minibatch perplexity: 6.22
Validation set perplexity: 4.99
Average loss at step 4700: 1.644642 learning rate: 10.000000
Minibatch perplexity: 5.23
Validation set perplexity: 5.17
Average loss at step 4800: 1.628960 learning rate: 10.000000
Minibatch perplexity: 5.07
Validation set perplexity: 5.15
Average loss at step 4900: 1.644662 learning rate: 10.000000
Minibatch perplexity: 5.13
Validation set perplexity: 5.03
Average loss at step 5000: 1.670021 learning rate: 1.000000
Minibatch perplexity: 6.25
th time low blanssurer evereences achos wosly when the exprendutions of it what 
quesed were sopior infasur the futtime by russional history it vilia musice larg
bucks affamiate by dulwninger kevents example by the freecflacts of knapsual gur
stones stech the methax chancars creat

## b)  Write a bigram-based LSTM.

Using our embedding structure above the implementation of bigram-inputs is easily fixed by embedding the 2 consecutive character modeled by an vocabulary_size*vocabulary size vector (to get unique embeddings for each bigram). So besides the input and embedding nothing changes since we're still predicting single characters.

In [20]:
num_nodes = 64
embedding_size = 160

graph = tf.Graph()
with graph.as_default():

    # Embedding, now embedding a bigram input.
    embeddings = tf.Variable(tf.random_uniform([vocabulary_size**2, embedding_size], -1.0, 1.0))
    
    # Parameters, adjust the inputs to be of size embedding_size instead of vocabulary_size:
    # Input gate: input, previous output, and bias.
    ix = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
    im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ib = tf.Variable(tf.zeros([1, num_nodes]))
    # Forget gate: input, previous output, and bias.
    fx = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
    fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    fb = tf.Variable(tf.zeros([1, num_nodes]))
    # Memory cell: input, state and bias.                             
    cx = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
    cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    cb = tf.Variable(tf.zeros([1, num_nodes]))
    # Output gate: input, previous output, and bias.
    ox = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
    om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ob = tf.Variable(tf.zeros([1, num_nodes]))

    # Merge.
    i_m = tf.concat([ix, fx, cx, ox], 1)
    o_m = tf.concat([im, fm, cm, om], 1)
    b_m = tf.concat([ib, fb, cb, ob], 1)
  
    # Variables saving state across unrollings.
    saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    
    # Classifier weights and biases. Output is still an probability ditribution over characters.
    w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
    b = tf.Variable(tf.zeros([vocabulary_size]))

    # Every gate takes the new input (a one-hot-encoded word) -> output, previous output and a bias.
    # We store the outputs and states across unrollings in saved_output and saved_state.

    # The classifier then uses the output to predict a probablity distribution of the next character.

    # Definition of the cell computation.
    def lstm_cell(i, o, state):
        input_forget_update_out = tf.matmul(i, i_m) + tf.matmul(o, o_m) + b_m
        inp, forg, update, out = tf.split(input_forget_update_out, 4, 1)
        input_gate = tf.sigmoid(inp)
        forget_gate = tf.sigmoid(forg)
        output_gate = tf.sigmoid(out)
        state = forget_gate * state + input_gate * tf.tanh(update)
        return output_gate * tf.tanh(state), state

    # Input data.
    train_data = list()
    for _ in range(num_unrollings + 1):
        train_data.append(
          tf.placeholder(tf.float32, shape=[batch_size, vocabulary_size]))
    train_temp = train_data[:num_unrollings]
    train_inputs = [(train_data[i], train_data[i+1]) for i in range(len(train_temp)-1)]
    train_labels = train_data[2:]
    #print(len(train_inputs))
    #print(len(train_labels))
    
    
    # Unrolled LSTM loop.
    outputs = list()
    output = saved_output
    state = saved_state
    for i in train_inputs: # For each unrolling we have an 64 new characters.
        """Given the new embed-coded input character, the previous state and the output of the previous LSTM cell, get the new
        output and state. And then append the output to ouputs, since we're going to compare each output to the labels 
        stored in train_labels."""
        embed_i = tf.nn.embedding_lookup(embeddings, 
                            tf.argmax(i[0], axis=1) + vocabulary_size*tf.argmax(i[1], axis=1)) # Change input to LSTM to the embedding.
        
        output, state = lstm_cell(embed_i, output, state) 
        outputs.append(output) 


    # State saving across unrollings.
    # tf.control_dependencies ensures that we update saved_output and saved_state before performing the loss calculations.
    with tf.control_dependencies([saved_output.assign(output),
                                saved_state.assign(state)]):
        # Classifier.
        # tf.concat(x,0) merges the first dimension of x with it's second.
        logits = tf.nn.xw_plus_b(tf.concat(outputs, 0), w, b)
        #print(logits.get_shape())
        #print(tf.concat(train_labels, 0).get_shape())
        loss = tf.reduce_mean(
          tf.nn.softmax_cross_entropy_with_logits(
            labels=tf.concat(train_labels, 0), logits=logits))

    # Optimizer.
    global_step = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    gradients, v = zip(*optimizer.compute_gradients(loss))  # (gradient, value) tuple
    gradients, _ = tf.clip_by_global_norm(gradients, 1.25) # Clip the gradients to avoid "exploding gradient"
    optimizer = optimizer.apply_gradients( 
    zip(gradients, v), global_step=global_step) # Optimize with clipped gradients.

    # Predictions.
    train_prediction = tf.nn.softmax(logits)
  
    # Sampling and validation eval: batch 1, no unrolling.
    sample_input = [tf.placeholder(tf.float32, shape=[1, vocabulary_size]) for _ in range(2)] # Bigram input.
    #print(sample_input)
    sample_inpit_embed = tf.nn.embedding_lookup(embeddings, 
            tf.argmax(sample_input[0], axis=1) +  vocabulary_size*tf.argmax(sample_input[1], axis=1)) # Embedded input
    saved_sample_output = tf.Variable(tf.zeros([1, num_nodes])) # Sample output.
    saved_sample_state = tf.Variable(tf.zeros([1, num_nodes])) # Sample saved state.
    reset_sample_state = tf.group( # To clear the memory of network, at the start of every new sequence. 
        saved_sample_output.assign(tf.zeros([1, num_nodes])),
        saved_sample_state.assign(tf.zeros([1, num_nodes])))
    sample_output, sample_state = lstm_cell(
    sample_inpit_embed, saved_sample_output, saved_sample_state) # Generate new output and state.
    with tf.control_dependencies([saved_sample_output.assign(sample_output), # Ensure variables updated.
                                saved_sample_state.assign(sample_state)]):
        sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b)) # Make sample prediciton.

Next we have to adjust our code so that the input is a bigram rather then a single character, this means chaning our bath generator for the valid batches.

In [21]:
valid_batches = BatchGenerator(valid_text, 1, 2)
print(batches2string(valid_batches.next()))
print(batches2string(valid_batches.next()))

[' an']
['nar']


In [22]:
num_steps = 7001
summary_frequency = 100
start_time = time.time()

with tf.Session(graph=graph) as session:
  tf.global_variables_initializer().run()
  print('Initialized')
  mean_loss = 0
  for step in range(num_steps):
    batches = train_batches.next()
    feed_dict = dict()
    for i in range(num_unrollings + 1): 
      feed_dict[train_data[i]] = batches[i]
    _, l, predictions, lr = session.run(
      [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
    mean_loss += l
    if step % summary_frequency == 0:
      if step > 0:
        mean_loss = mean_loss / summary_frequency
      # The mean loss is an estimate of the loss over the last few batches.
      print(
        'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
      mean_loss = 0
      labels = np.concatenate(list(batches)[2:]) # Create the labels.
      print('Minibatch perplexity: %.2f' % float(
        np.exp(logprob(predictions, labels)))) # Calculate perplexity of batch.
      if step % (summary_frequency * 10) == 0:
        # Generate some samples.
        print('=' * 80)
        for _ in range(5):
          feed = [sample(random_distribution()), sample(random_distribution())]
          sentence = characters(feed[0])[0] + characters(feed[1])[0]
          reset_sample_state.run()
          for _ in range(79):
            prediction = sample_prediction.eval({sample_input_i: 
                                                 feed_i for sample_input_i, feed_i in zip(sample_input, feed)})
            feed.append(sample(prediction))
            del feed[0]
            sentence += characters(feed[1])[0]
          print(sentence)
        print('=' * 80)
      # Measure validation set perplexity.
      reset_sample_state.run()
      valid_logprob = 0
      for _ in range(valid_size):
        b = valid_batches.next()
        predictions = sample_prediction.eval({sample_input_i: 
                                                 feed_i for sample_input_i, feed_i in zip(sample_input, b)})
        valid_logprob = valid_logprob + logprob(predictions, b[2])
      print('Validation set perplexity: %.2f' % float(np.exp(
        valid_logprob / valid_size)))
    
print("--- %s seconds ---" % (time.time() - start_time))

Initialized
Average loss at step 0: 3.293594 learning rate: 10.000000
Minibatch perplexity: 26.94
ua gefxudt ierbi inltfeutlie b hdn exevd piayy rc cq oi re nl me a obf opua puoq 
jua ub hqt sj eleqjtj b pr qyu zyw ef rdkkmyy rrw  rjie oyxj rp  tial tut e sd wd
ec lojk ienp xaokvur   et ozyxlt g toa vt a nwc a i  qhlepge n s kwqxo l jizmd ti
byee sah u  u  wvo op iarl qe gjjexr dzpdewei   tepl  l exwqwduamrs reqi lezktq s
ebxu  qk  oii ri bmd oeneotkqkrra ws wdi lri m ml bngttnd erhh zc kn the erfjvz h
Validation set perplexity: 23.31
Average loss at step 100: 2.261965 learning rate: 10.000000
Minibatch perplexity: 7.57
Validation set perplexity: 8.71
Average loss at step 200: 1.952973 learning rate: 10.000000
Minibatch perplexity: 6.97
Validation set perplexity: 7.98
Average loss at step 300: 1.872797 learning rate: 10.000000
Minibatch perplexity: 6.36
Validation set perplexity: 7.65
Average loss at step 400: 1.813687 learning rate: 10.000000
Minibatch perplexity: 5.85
Validation set 

Validation set perplexity: 6.60
Average loss at step 4500: 1.580288 learning rate: 10.000000
Minibatch perplexity: 4.64
Validation set perplexity: 6.46
Average loss at step 4600: 1.583433 learning rate: 10.000000
Minibatch perplexity: 4.96
Validation set perplexity: 6.60
Average loss at step 4700: 1.593230 learning rate: 10.000000
Minibatch perplexity: 4.64
Validation set perplexity: 6.60
Average loss at step 4800: 1.586886 learning rate: 10.000000
Minibatch perplexity: 5.20
Validation set perplexity: 6.81
Average loss at step 4900: 1.609750 learning rate: 10.000000
Minibatch perplexity: 4.87
Validation set perplexity: 6.68
Average loss at step 5000: 1.618724 learning rate: 1.000000
Minibatch perplexity: 5.73
kkedally a lobelins of iiisment unionary potte suppread not from manics for mikee
qmt of first to june zero neople estate to gextedus his this to a rendised are gr
nnected the promous tre two chapification proviners carl can of firchough constil
stripa east invensydhousamon cribin

## c) Introduce Dropout.

Introduce dropout at the input and increase the complexity in terms of embedding size and number of nodes in the LSTM network

In [315]:
num_nodes = 124
embedding_size = 180
dropout_rate = 0.7

graph = tf.Graph()
with graph.as_default():

    # Embedding, now embedding a bigram input.
    embeddings = tf.Variable(tf.random_uniform([vocabulary_size**2, embedding_size], -1.0, 1.0))
    
    # Parameters, adjust the inputs to be of size embedding_size instead of vocabulary_size:
    # Input gate: input, previous output, and bias.
    ix = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
    im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ib = tf.Variable(tf.zeros([1, num_nodes]))
    # Forget gate: input, previous output, and bias.
    fx = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
    fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    fb = tf.Variable(tf.zeros([1, num_nodes]))
    # Memory cell: input, state and bias.                             
    cx = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
    cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    cb = tf.Variable(tf.zeros([1, num_nodes]))
    # Output gate: input, previous output, and bias.
    ox = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
    om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ob = tf.Variable(tf.zeros([1, num_nodes]))

    # Merge.
    i_m = tf.concat([ix, fx, cx, ox], 1)
    o_m = tf.concat([im, fm, cm, om], 1)
    b_m = tf.concat([ib, fb, cb, ob], 1)
  
    # Variables saving state across unrollings.
    saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    
    # Classifier weights and biases. Output is still an probability ditribution over characters.
    w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
    b = tf.Variable(tf.zeros([vocabulary_size]))

    # Every gate takes the new input (a one-hot-encoded word) -> output, previous output and a bias.
    # We store the outputs and states across unrollings in saved_output and saved_state.

    # The classifier then uses the output to predict a probablity distribution of the next character.

    # Definition of the cell computation.
    def lstm_cell(i, o, state):
        input_forget_update_out = tf.matmul(i, i_m) + tf.matmul(o, o_m) + b_m
        inp, forg, update, out = tf.split(input_forget_update_out, 4, 1)
        input_gate = tf.sigmoid(inp)
        forget_gate = tf.sigmoid(forg)
        output_gate = tf.sigmoid(out)
        state = forget_gate * state + input_gate * tf.tanh(update)
        return output_gate * tf.tanh(state), state

    # Input data.
    train_data = list()
    for _ in range(num_unrollings + 1):
        train_data.append(
          tf.placeholder(tf.float32, shape=[batch_size, vocabulary_size]))
    train_temp = train_data[:num_unrollings]
    train_inputs = [(train_data[i], train_data[i+1]) for i in range(len(train_temp)-1)]
    train_labels = train_data[2:]
    #print(len(train_inputs))
    #print(len(train_labels))
    
    
    # Unrolled LSTM loop.
    outputs = list()
    output = saved_output
    state = saved_state
    for i in train_inputs: # For each unrolling we have an 64 new characters.
        """Given the new embed-coded input character, the previous state and the output of the previous LSTM cell, get the new
        output and state. And then append the output to ouputs, since we're going to compare each output to the labels 
        stored in train_labels."""
        embed_i = tf.nn.embedding_lookup(embeddings, 
                            tf.argmax(i[0], axis=1) + vocabulary_size*tf.argmax(i[1], axis=1)) # Change input to LSTM to the embedding.
        #print(embed_i.get_shape())
        dropout_i = tf.nn.dropout(embed_i, dropout_rate) # Add dropout to input.
        output, state = lstm_cell(dropout_i, output, state) 
        outputs.append(output) 


    # State saving across unrollings.
    # tf.control_dependencies ensures that we update saved_output and saved_state before performing the loss calculations.
    with tf.control_dependencies([saved_output.assign(output),
                                saved_state.assign(state)]):
        # Classifier.
        # tf.concat(x,0) merges the first dimension of x with it's second.
        logits = tf.nn.xw_plus_b(tf.concat(outputs, 0), w, b)
        #print(logits.get_shape())
        #print(tf.concat(train_labels, 0).get_shape())
        loss = tf.reduce_mean(
          tf.nn.softmax_cross_entropy_with_logits(
            labels=tf.concat(train_labels, 0), logits=logits))

    # Optimizer.
    global_step = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    gradients, v = zip(*optimizer.compute_gradients(loss))  # (gradient, value) tuple
    gradients, _ = tf.clip_by_global_norm(gradients, 1.25) # Clip the gradients to avoid "exploding gradient"
    optimizer = optimizer.apply_gradients( 
    zip(gradients, v), global_step=global_step) # Optimize with clipped gradients.

    # Predictions.
    train_prediction = tf.nn.softmax(logits)
  
    # Sampling and validation eval: batch 1, no unrolling.
    sample_input = [tf.placeholder(tf.float32, shape=[1, vocabulary_size]) for _ in range(2)] # Bigram input.
    #print(sample_input)
    sample_inpit_embed = tf.nn.embedding_lookup(embeddings, 
            tf.argmax(sample_input[0], axis=1) +  vocabulary_size*tf.argmax(sample_input[1], axis=1)) # Embedded input
    saved_sample_output = tf.Variable(tf.zeros([1, num_nodes])) # Sample output.
    saved_sample_state = tf.Variable(tf.zeros([1, num_nodes])) # Sample saved state.
    reset_sample_state = tf.group( # To clear the memory of network, at the start of every new sequence. 
        saved_sample_output.assign(tf.zeros([1, num_nodes])),
        saved_sample_state.assign(tf.zeros([1, num_nodes])))
    sample_output, sample_state = lstm_cell(
    sample_inpit_embed, saved_sample_output, saved_sample_state) # Generate new output and state.
    with tf.control_dependencies([saved_sample_output.assign(sample_output), # Ensure variables updated.
                                saved_sample_state.assign(sample_state)]):
        sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b)) # Make sample prediciton.

(64, 180)
(64, 180)
(64, 180)
(64, 180)
(64, 180)
(64, 180)
(64, 180)
(64, 180)
(64, 180)
(64, 180)
(64, 180)
(64, 180)
(64, 180)
(64, 180)
(64, 180)
(64, 180)
(64, 180)
(64, 180)
(64, 180)


In [29]:
num_steps = 20001
summary_frequency = 400
start_time = time.time()

with tf.Session(graph=graph) as session:
  tf.global_variables_initializer().run()
  print('Initialized')
  mean_loss = 0
  for step in range(num_steps):
    batches = train_batches.next()
    feed_dict = dict()
    for i in range(num_unrollings + 1): 
      feed_dict[train_data[i]] = batches[i]
    _, l, predictions, lr = session.run(
      [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
    mean_loss += l
    if step % summary_frequency == 0:
      if step > 0:
        mean_loss = mean_loss / summary_frequency
      # The mean loss is an estimate of the loss over the last few batches.
      print(
        'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
      mean_loss = 0
      labels = np.concatenate(list(batches)[2:]) # Create the labels.
      print('Minibatch perplexity: %.2f' % float(
        np.exp(logprob(predictions, labels)))) # Calculate perplexity of batch.
      if step % (summary_frequency * 10) == 0:
        # Generate some samples.
        print('=' * 80)
        for _ in range(5):
          feed = [sample(random_distribution()), sample(random_distribution())]
          sentence = characters(feed[0])[0] + characters(feed[1])[0]
          reset_sample_state.run()
          for _ in range(79):
            prediction = sample_prediction.eval({sample_input_i: 
                                                 feed_i for sample_input_i, feed_i in zip(sample_input, feed)})
            feed.append(sample(prediction))
            del feed[0]
            sentence += characters(feed[1])[0]
          print(sentence)
        print('=' * 80)
      # Measure validation set perplexity.
      reset_sample_state.run()
      valid_logprob = 0
      for _ in range(valid_size):
        b = valid_batches.next()
        predictions = sample_prediction.eval({sample_input_i: 
                                                 feed_i for sample_input_i, feed_i in zip(sample_input, b)})
        valid_logprob = valid_logprob + logprob(predictions, b[2])
      print('Validation set perplexity: %.2f' % float(np.exp(
        valid_logprob / valid_size)))
    
print("--- %s seconds ---" % (time.time() - start_time))

Initialized
Average loss at step 0: 3.361895 learning rate: 10.000000
Minibatch perplexity: 28.84
fbz m remeooy o v hgjrs n  me gmes u i slpotekapeeyley uirkes pdom czinj zo  u n 
eph vveeuiiem th u i strbes wh yts kzhgcg  f vcen emesileeisrwjzeoded eeaertn lfv
ahzi n es stc ttesbfeteid et ioekf  se eg   koqdezjnv qs oep ikz   ne lc coeniyex
ukl  zdtvqeeoteuaz n o j cpy q eweeoeeyks xs lzuco r vsea  ja  ieeudhsmteh tqh vv
py keu eo l oea sle i  sc bekvep hslg unyebdwa eko cselbnt m d enrseeqoeekbo s ev
Validation set perplexity: 27.33
Average loss at step 400: 2.114720 learning rate: 10.000000
Minibatch perplexity: 6.70
Validation set perplexity: 7.73
Average loss at step 800: 1.869328 learning rate: 10.000000
Minibatch perplexity: 6.62
Validation set perplexity: 7.19
Average loss at step 1200: 1.784729 learning rate: 10.000000
Minibatch perplexity: 5.65
Validation set perplexity: 6.80
Average loss at step 1600: 1.760834 learning rate: 10.000000
Minibatch perplexity: 6.47
Validation se

Validation set perplexity: 5.83
Average loss at step 18000: 1.629276 learning rate: 0.010000
Minibatch perplexity: 5.50
Validation set perplexity: 5.83
Average loss at step 18400: 1.639973 learning rate: 0.010000
Minibatch perplexity: 6.18
Validation set perplexity: 5.84
Average loss at step 18800: 1.636504 learning rate: 0.010000
Minibatch perplexity: 4.96
Validation set perplexity: 5.84
Average loss at step 19200: 1.647540 learning rate: 0.010000
Minibatch perplexity: 5.56
Validation set perplexity: 5.84
Average loss at step 19600: 1.677715 learning rate: 0.010000
Minibatch perplexity: 5.05
Validation set perplexity: 5.84
Average loss at step 20000: 1.651199 learning rate: 0.001000
Minibatch perplexity: 5.40
xds   ke which turn it ia the laking sorter the government his histories one one 
cp in shaped timily to eight and it bpick acquired and eight two spher presined a
gqls to memberb one john oriesd instimunity gmented by up the broad and gdom betw
es are form located at the bloadio

---
Problem 3
---------

(difficult!)

Write a sequence-to-sequence LSTM which mirrors all the words in a sentence. For example, if your input is:

    the quick brown fox
    
the model should attempt to output:

    eht kciuq nworb xof
    
Refer to the lecture on how to put together a sequence-to-sequence model, as well as [this article](http://arxiv.org/abs/1409.3215) for best practices.

---

In [5]:
url = 'http://mattmahoney.net/dc/'
def maybe_download(filename, expected_bytes, location):
    """Download a file if not present, and make sure it's the right size."""
    if not os.path.exists(filename):
        filename, _ = urlretrieve(url + filename, os.path.join(location,filename))
    statinfo = os.stat(filename)
    if statinfo.st_size == expected_bytes:
        print('Found and verified %s' % filename)
    else:
        print(statinfo.st_size)
        raise Exception(
          'Failed to verify ' + filename + '. Can you get to it with a browser?')
    return filename

def read_data(filename):
    with zipfile.ZipFile(filename) as f:
        name = f.namelist()[0]
        data = tf.compat.as_str(f.read(name))
    return data
  
filename = maybe_download('text8.zip', 31344016, 'input')
text = read_data(filename)
print('Data size %d' % len(text))

Found and verified input/text8.zip
Data size 100000000


In [66]:
vocabulary_size = len(string.ascii_lowercase) + 1 # [a-z] + ' '
first_letter = ord(string.ascii_lowercase[0])

def char2id(char):
    if char in string.ascii_lowercase:
        return ord(char) - first_letter + 1
    elif char == ' ':
        return 0
    else:
        print('Unexpected character: %s' % char)
        return 0
  
def id2char(dictid):
    if dictid > 0:
        return chr(dictid + first_letter - 1)
    else:
        return ' '

print(char2id('a'), char2id('z'), char2id(' '), char2id('ï'))
print(id2char(1), id2char(26), id2char(0))

Unexpected character: ï
1 26 0 0
a z  


In [206]:
data_words = text.split() # Split data into words and keep spaces.
print(data_words[0:100])

['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the', 'diggers', 'of', 'the', 'english', 'revolution', 'and', 'the', 'sans', 'culottes', 'of', 'the', 'french', 'revolution', 'whilst', 'the', 'term', 'is', 'still', 'used', 'in', 'a', 'pejorative', 'way', 'to', 'describe', 'any', 'act', 'that', 'used', 'violent', 'means', 'to', 'destroy', 'the', 'organization', 'of', 'society', 'it', 'has', 'also', 'been', 'taken', 'up', 'as', 'a', 'positive', 'label', 'by', 'self', 'defined', 'anarchists', 'the', 'word', 'anarchism', 'is', 'derived', 'from', 'the', 'greek', 'without', 'archons', 'ruler', 'chief', 'king', 'anarchism', 'as', 'a', 'political', 'philosophy', 'is', 'the', 'belief', 'that', 'rulers', 'are', 'unnecessary', 'and', 'should', 'be', 'abolished', 'although', 'there', 'are', 'differing']


In [208]:
valid_size = 100
valid_words = data_words[:valid_size]
train_words = data_words[valid_size:]
train_size = len(data_words)
print(str(train_size) + "\n", train_words[:64])
print(str(valid_size) + "\n", valid_words)

17005207
 ['interpretations', 'of', 'what', 'this', 'means', 'anarchism', 'also', 'refers', 'to', 'related', 'social', 'movements', 'that', 'advocate', 'the', 'elimination', 'of', 'authoritarian', 'institutions', 'particularly', 'the', 'state', 'the', 'word', 'anarchy', 'as', 'most', 'anarchists', 'use', 'it', 'does', 'not', 'imply', 'chaos', 'nihilism', 'or', 'anomie', 'but', 'rather', 'a', 'harmonious', 'anti', 'authoritarian', 'society', 'in', 'place', 'of', 'what', 'are', 'regarded', 'as', 'authoritarian', 'political', 'structures', 'and', 'coercive', 'economic', 'institutions', 'anarchists', 'advocate', 'social', 'relations', 'based', 'upon']
100
 ['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the', 'diggers', 'of', 'the', 'english', 'revolution', 'and', 'the', 'sans', 'culottes', 'of', 'the', 'french', 'revolution', 'whilst', 'the', 'term', 'is', 'still', 'used', 'in', 'a', 'pejorati

In [301]:
batch_size=64
sentance_length = 50 # How long a sentance will be.


def generate_batch(data, batch_size):
    batch = np.zeros(shape=(batch_size, sentance_length, vocabulary_size), dtype=np.float)
    for j in range(batch_size):
        i = 0
        batch_j = list()
        while i < sentance_length:
            ind = np.random.randint(1,len(data))
            randomword = data[ind]
            if i + len(randomword) + 1 < sentance_length:
                if i != 0: 
                    #batch_j.append(' ')
                    #i += 1
                    i += 1
                batch_j.append(randomword)
                i += len(randomword)
                #print(i)
            else:
                if batch_j: break
        for k,char in enumerate(' '.join(batch_j)):
            batch[j, k, char2id(char)] = 1.0
    return batch

def batch2string(batch):
    batch_string = list()
    for bat in batch:
        s = ""
        for b in bat: 
            s= s + (id2char(np.argmax((b))))
        batch_string.append(s)
    return batch_string

batch = generate_batch(train_words, batch_size)
print(batch2string(batch)[1:10])

['deemed that is only such total or equivalent      ', 'i anarcho two the two when soviet follow genocide ', 'originally introduction the his torpparit zero    ', 'other westminster fled in gland in s dido of      ', 'charge it two that one and tago three circuits of ', 'if directly third ed r metal pope plagues life    ', 'keep increased three german english lattice to    ', 'five compose all the warming pakistani hollywood  ', 's council eight research produces ump the s of    ']


In [302]:
def logprob(predictions, labels):
  """Log-probability of the true labels in a predicted batch."""
  predictions[predictions < 1e-10] = 1e-10
  return np.sum(np.multiply(labels, -np.log(predictions))) / labels.shape[0]

def sample_distribution(distribution):
  """Sample one element from a distribution assumed to be an array of normalized
  probabilities.
  """
  r = random.uniform(0, 1)
  s = 0
  for i in range(len(distribution)):
    s += distribution[i]
    if s >= r:
      return i
  return len(distribution) - 1

def sample(prediction):
  """Turn a (column) prediction into 1-hot encoded samples."""
  p = np.zeros(shape=[1, vocabulary_size], dtype=np.float)
  p[0, sample_distribution(prediction[0])] = 1.0
  return p

def random_distribution():
  """Generate a random column of probabilities."""
  b = np.random.uniform(0.0, 1.0, size=[1, vocabulary_size])
  return b/np.sum(b, 1)[:,None]

In [376]:
num_nodes = 24
embedding_size = 100
dropout_rate = 0.5

graph = tf.Graph()
with graph.as_default():

    # Embedding, now embedding a bigram input.
    embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
    
    # Parameters, adjust the inputs to be of size embedding_size instead of vocabulary_size:
    # Input gate: input, previous output, and bias.
    ix = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
    im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ib = tf.Variable(tf.zeros([1, num_nodes]))
    # Forget gate: input, previous output, and bias.
    fx = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
    fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    fb = tf.Variable(tf.zeros([1, num_nodes]))
    # Memory cell: input, state and bias.                             
    cx = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
    cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    cb = tf.Variable(tf.zeros([1, num_nodes]))
    # Output gate: input, previous output, and bias.
    ox = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
    om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ob = tf.Variable(tf.zeros([1, num_nodes]))

    # Merge.
    i_m = tf.concat([ix, fx, cx, ox], 1)
    o_m = tf.concat([im, fm, cm, om], 1)
    b_m = tf.concat([ib, fb, cb, ob], 1)
  
    # Parameters, adjust the inputs to be of size embedding_size instead of vocabulary_size:
    # Input gate: input, previous output, and bias.
    ix2 = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    im2 = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ib2 = tf.Variable(tf.zeros([1, num_nodes]))
    # Forget gate: input, previous output, and bias.
    fx2 = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    fm2 = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    fb2 = tf.Variable(tf.zeros([1, num_nodes]))
    # Memory cell: input, state and bias.                             
    cx2 = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    cm2 = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    cb2 = tf.Variable(tf.zeros([1, num_nodes]))
    # Output gate: input, previous output, and bias.
    ox2 = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    om2 = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ob2 = tf.Variable(tf.zeros([1, num_nodes]))

    # Merge.
    i_m2 = tf.concat([ix2, fx2, cx2, ox2], 1)
    o_m2 = tf.concat([im2, fm2, cm2, om2], 1)
    b_m2 = tf.concat([ib2, fb2, cb2, ob2], 1)
    
    # Variables saving state across unrollings.
    saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    
    # Classifier weights and biases. Output is still an probability ditribution over characters.
    w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
    b = tf.Variable(tf.zeros([vocabulary_size]))

    # Every gate takes the new input (a one-hot-encoded word) -> output, previous output and a bias.
    # We store the outputs and states across unrollings in saved_output and saved_state.

    # The classifier then uses the output to predict a probablity distribution of the next character.

    # Definition of the cell computation.
    def lstm_cell(i, o, state):
        input_forget_update_out = tf.matmul(i, i_m) + tf.matmul(o, o_m) + b_m
        inp, forg, update, out = tf.split(input_forget_update_out, 4, 1)
        input_gate = tf.sigmoid(inp)
        forget_gate = tf.sigmoid(forg)
        output_gate = tf.sigmoid(out)
        state = forget_gate * state + input_gate * tf.tanh(update)
        return output_gate * tf.tanh(state), state

    def lstm_cell_decode(i, o, state):
        input_forget_update_out = tf.matmul(i, i_m2) + tf.matmul(o, o_m2) + b_m2
        inp, forg, update, out = tf.split(input_forget_update_out, 4, 1)
        input_gate = tf.sigmoid(inp)
        forget_gate = tf.sigmoid(forg)
        output_gate = tf.sigmoid(out)
        state = forget_gate * state + input_gate * tf.tanh(update)
        return output_gate * tf.tanh(state), state

    # Input data.
    train_data = list()
    for _ in range(sentance_length):
        train_data.append(
          tf.placeholder(tf.float32, shape=[batch_size, vocabulary_size]))
    train_inputs = train_data
    train_labels = train_data[::-1]
    #print(len(train_inputs))
    #print(len(train_labels))
    
    # Encode
    # Unrolled LSTM loop.
    outputs = list()
    output = saved_output
    state = saved_state
    for i in train_inputs: # For each unrolling we have an 64 new characters.
        """Given the new embed-coded input character, the previous state and the output of the previous LSTM cell, get the new
        output and state. And then append the output to ouputs, since we're going to compare each output to the labels 
        stored in train_labels."""
        #print(i.get_shape())
        embed_i = tf.nn.embedding_lookup(embeddings, 
                            tf.argmax(i, axis=1))
        #dropout_i = tf.nn.dropout(embed_i, dropout_rate) # Add dropout to input.
        #print(embed_i.get_shape())
        #print(output.get_shape())
        output, state = lstm_cell(embed_i, output, state) 

    for i in train_labels:
        output, state = lstm_cell_decode(output, output, state)
        outputs.append(output)

    # State saving across unrollings.
    # tf.control_dependencies ensures that we update saved_output and saved_state before performing the loss calculations.
    with tf.control_dependencies([saved_output.assign(output),
                                saved_state.assign(state)]):
        # Classifier.
        # tf.concat(x,0) merges the first dimension of x with it's second.
        logits = tf.nn.xw_plus_b(tf.concat(outputs, 0), w, b)
        #print(logits.get_shape())
        #print(tf.concat(train_labels, 0).get_shape())
        loss = tf.reduce_mean(
          tf.nn.softmax_cross_entropy_with_logits(
            labels=tf.concat(train_labels, 0), logits=logits))

    # Optimizer.
    global_step = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    gradients, v = zip(*optimizer.compute_gradients(loss))  # (gradient, value) tuple
    gradients, _ = tf.clip_by_global_norm(gradients, 1.25) # Clip the gradients to avoid "exploding gradient"
    optimizer = optimizer.apply_gradients( 
    zip(gradients, v), global_step=global_step) # Optimize with clipped gradients.

    # Predictions.
    train_prediction = tf.nn.softmax(logits)
  
    # Sampling and validation eval: batch 1, no unrolling.
    sample_input = [tf.placeholder(tf.float32, shape=[1, vocabulary_size]) for _ in range(sentance_length)] 
    saved_sample_output = tf.Variable(tf.zeros([1, num_nodes])) # Sample output.
    saved_sample_state = tf.Variable(tf.zeros([1, num_nodes])) # Sample saved state.
    reset_sample_state = tf.group( 
        saved_sample_output.assign(tf.zeros([1, num_nodes])),
        saved_sample_state.assign(tf.zeros([1, num_nodes])))
    
    sample_outputs = list()
    sample_output = saved_sample_output
    sample_state = saved_sample_state
    for i in sample_input:
        embed_i = tf.nn.embedding_lookup(embeddings, 
            tf.argmax(i, axis=1))
        sample_output, sample_state = lstm_cell(embed_i, sample_output, sample_state)
    
    for i in sample_input:
        sample_output, sample_state = lstm_cell_decode(sample_output, sample_output, sample_state)
        sample_outputs.append(output)
        
    with tf.control_dependencies([saved_sample_output.assign(sample_output), # Ensure variables updated.
                                saved_sample_state.assign(sample_state)]):
        log = tf.nn.xw_plus_b(tf.concat(sample_outputs, 0), w, b)
        sample_prediction = tf.nn.softmax(log)
        

In [393]:
num_steps = 10001
summary_frequency = 100
start_time = time.time()

with tf.Session(graph=graph) as session:
  tf.global_variables_initializer().run()
  print('Initialized')
  mean_loss = 0
  for step in range(num_steps):
    batch = generate_batch(train_words, batch_size)
    feed_dict = dict()
    for i in range(sentance_length): 
      feed_dict[train_data[i]] = batch[:,i,:]
    _, l, predictions, lr = session.run(
      [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
    mean_loss += l
    if step % summary_frequency == 0:
      if step > 0:
        mean_loss = mean_loss / summary_frequency
      # The mean loss is an estimate of the loss over the last few batches.
      print(
        'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
      mean_loss = 0
      labels = batch # Create the labels.
      print('Minibatch perplexity: %.2f' % float(
        np.exp(logprob(predictions.reshape([labels.shape[0],labels.shape[1],labels.shape[2]]), labels)))) # Calculate perplexity of batch.
      # Measure validation set perplexity.
      print(batch2string(labels)[:3])
      print(batch2string(predictions.reshape([labels.shape[0],labels.shape[1],labels.shape[2]])[:3]))
      """
      reset_sample_state.run()
      valid_logprob = 0
      b = generate_batch(valid_words, 1)
      bb = np.reshape(b[0,:,:],[50,1,27])
      d = {sample_input_i: 
          feed_i for sample_input_i, feed_i in zip(sample_input, bb)}
      #print(d)
      predictions = sample_prediction.eval(feed_dict = d)      
      valid_logprob = valid_logprob + logprob(predictions, b)
      print('Validation set perplexity: %.2f' % float(np.exp(
        valid_logprob / valid_size)))
      for i in predictions:
            print(np.shape(i))
      """
print("--- %s seconds ---" % (time.time() - start_time))

Initialized
Average loss at step 0: 3.034232 learning rate: 10.000000
Minibatch perplexity: 772332471246575071240784494286477140677903726883612044585766748160.00
['norway to patterns whereas zero time by           ', 'dialects other nine a red and area marxism colony ', 'nutmeg a also three s ships she academics gleam   ']
['dwsddddfsddddddddddkdddddddddddddddddddddddddddddd', 'ddhddbddddwdddvwvvvvjfvvvvvvvvvvvbvvvrvvvvvrvvvvvv', 'vvvvvvvvvvvvvvvvevvrvvvvwrvvdwdddddhdddddddddddbdd']
Average loss at step 100: 2.693113 learning rate: 10.000000
Minibatch perplexity: 149123143524705647442546672025967691269368044347814857146368.00
['anthropology the one and hapalinae kind is large  ', 'answer issues from property habitable action duc  ', 'a eight one a berlin of watering as oils way van  ']
['eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee', 'eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee', 'eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee']
Average loss at step 200: 2.639030 learn

Average loss at step 1800: 2.465129 learning rate: 10.000000
Minibatch perplexity: 46166734021184222105251693950545868567880522896456734466891579392.00
['veterans of palatalized then in is to rahman two  ', 'in damaged to maker two for in re atheist nine if ', 'eight one the available two the the babe zero     ']
['ssssssssssssssssssssssssssssssssssssessssssssssses', 'ssssssssssssessssssssssssssssssssssssssssssssassss', 'esssssssssssesssssssssssssessisassnssssseaissssass']
Average loss at step 1900: 2.466820 learning rate: 10.000000
Minibatch perplexity: 53424559668538273408879852028727132974304461335271449342697799680.00
['lunar years and robotic of some believed his a    ', 'highly islam his with etching in were of great    ', 'is one saddles battle not eight to its seven that ']
['oooooooooooooooooooooooooooooooooooooooooooooooooo', 'ooooooooooooooooaooottoooooooooootoooooooooooosoos', 'ooaoosooootoosootoooooooooootoaoosatttooootooottos']
Average loss at step 2000: 2.460286 learning

Average loss at step 3600: 2.470948 learning rate: 10.000000
Minibatch perplexity: 9321136119417829648803873080190439881390636629602861025005013322630471300423047947747328.00
['sin year road they von the people language        ', 'sketch presses eight the queen carriers these     ', 'became draft appear zero elegy statements french  ']
['nnnnnnnnnnnnnnnnnnnnnannnnnnnnnnnnnnnnnnnnnnnnnnnn', 'nnnnnnnnnnnannaaaaaanaaaaaaaaaaaananaaanaaaaaanaaa', 'aaanaaaaaaaannaaaaaaaannnnaaaaaaaanaaaaaaaaaaaanan']
Average loss at step 3700: 19.055624 learning rate: 10.000000
Minibatch perplexity: 13630274316923911372842946044956160347320710883384877906806021439256095027921388907918594242007513122279532241717454646864827558088055096123498237782888571595654788607493641197666678325569402048820416477060197615729912984675659407611244273252137075412142626177024.00
['the joined stations of his as visual priests      ', 'alias on of in is cities this bonavista opposed   ', 'of visible there european the have eig



Average loss at step 3900: 21.783706 learning rate: 10.000000
Minibatch perplexity: inf
['civil anchor ring has most a provincial absolute  ', 'beverage through now theme abalone running        ', 'corporate arms became near at unitary essays non  ']
['tttttttttttttttttttttttttttttttttttttttttttttttttt', 'tttttttttttttttttttttttttttttttttttttttttttttttttt', 'tttttttttttttttttttttttttttttttttttttttttttttttttt']


KeyboardInterrupt: 