Deep Learning
=============

Assignment 6
------------

After training a skip-gram model in `5_word2vec.ipynb`, the goal of this notebook is to train a LSTM character model over [Text8](http://mattmahoney.net/dc/textdata) data.

In [1]:
# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
from __future__ import print_function
import os
import numpy as np
import random
import string
import tensorflow as tf
import zipfile
from six.moves import range
from six.moves.urllib.request import urlretrieve

In [2]:
url = 'http://mattmahoney.net/dc/'

def maybe_download(filename, expected_bytes):
  """Download a file if not present, and make sure it's the right size."""
  if not os.path.exists(filename):
    filename, _ = urlretrieve(url + filename, filename)
  statinfo = os.stat(filename)
  if statinfo.st_size == expected_bytes:
    print('Found and verified %s' % filename)
  else:
    print(statinfo.st_size)
    raise Exception(
      'Failed to verify ' + filename + '. Can you get to it with a browser?')
  return filename

filename = maybe_download('text8.zip', 31344016)

Found and verified text8.zip


In [3]:
def read_data(filename):
  f = zipfile.ZipFile(filename)
  for name in f.namelist():
    return tf.compat.as_str(f.read(name))
  f.close()
  
text = read_data(filename)
print('Data size %d' % len(text))

Data size 100000000


Create a small validation set.

In [4]:
valid_size = 1000
valid_text = text[:valid_size]
train_text = text[valid_size:]
train_size = len(train_text)
print(train_size, train_text[:64])
print(valid_size, valid_text[:64])

99999000 ons anarchists advocate social relations based upon voluntary as
1000  anarchism originated as a term of abuse first used against earl


# Base: Utility functions map characters and vocabulary IDs

In [5]:
vocabulary_size = len(string.ascii_lowercase) + 1 # [a-z] + ' '
first_letter = ord(string.ascii_lowercase[0]) # brings back integer value for unicode char 

def char2id(char):
    if char in string.ascii_lowercase: # these are letters a to z
        # so, if the input character is in a->z, return its integer representation
        return ord(char) - first_letter + 1
    elif char == ' ':
        return 0
    else:
        print('Unexpected character: %s' % char)
        return 0
    
def id2char(dictid):
    if dictid > 0:
        return chr(dictid + first_letter - 1)
    else:
        return ' '

print(char2id('a'), char2id('z'), char2id(' '), char2id('ï'))
print(id2char(1), id2char(26), id2char(0))

Unexpected character: ï
1 26 0 0
a z  


so, this shows us that char2id returns an integer that corresponds to the unicode letter (character) when known and 0 when blank or not known. it also shows the reverse.

ord(c)
Given a string of length one, return an integer representing the Unicode code point of the character when the argument is a unicode object, or the value of the byte when the argument is an 8-bit string. For example, ord('a') returns the integer 97, ord(u'\u2020') returns 8224. This is the inverse of chr() for 8-bit strings and of unichr() for unicode objects. If a unicode argument is given and Python was built with UCS2 Unicode, then the character’s code point must be in the range [0..65535] inclusive; otherwise the string length is two, and a TypeError will be raised.

In [12]:
print(string.ascii_lowercase)
print(vocabulary_size)
print(len(text))
print(len(text)// batch_size)

abcdefghijklmnopqrstuvwxyz
27
100000000
1562500


# base: (training) batch generator

In [40]:
batch_size=64 # this is the number of text characters in the input mini-batch
num_unrollings=10

# building a BatchGenerator class of objects with methods to:
# 1. initialize a set of variables
# 2. generate batches of size = 64 of 11 letter segments
# 3. 

class BatchGenerator(object):
    def __init__(self, text, batch_size, num_unrollings):
        self._text = text # this is loading up the 100M character text file
        self._text_size = len(text)
        self._batch_size = batch_size
        self._num_unrollings = num_unrollings
        segment = self._text_size // batch_size # 1.6M batches
        self._cursor = [ offset * segment for offset in range(batch_size)]
        # tracks cursor position, incrementing its position by 64 places 
        self._last_batch = self._next_batch()
        
    def _next_batch(self): 
        """Generate a single batch from the current cursor position in the data."""
        batch = np.zeros(shape=(self._batch_size, vocabulary_size), dtype=np.float)
        # initializing a matrix of batch size = 64 X vocabulary size = 27.
        for b in range(self._batch_size): # for 0 -> 63
            batch[b, char2id(self._text[self._cursor[b]])] = 1.0 
            # put a 1 in column corresponding to the integer id of the text character the cursor is on
            # do that for each character in the current batch
            self._cursor[b] = (self._cursor[b] + 1) % self._text_size # increment the cursor
        return batch
    # so, result here s/b a sparse matrix w/ 64 rows, one for each input character, and
    # 27 rows, one for each letter of the alphabet. mostly 0's except in column corresponding
    # to ord integer value for each input character.
    
    def next(self):
        """Generate the next array of batches from the data. The array consists of
        the last batch of the previous array, followed by num_unrollings new ones."""
        batches = [self._last_batch] # set batch = to the last batch in prior set
        for step in range(self._num_unrollings): # 0->9+1 or 11
            batches.append(self._next_batch()) # append the next sequence of 10 batches
        self._last_batch = batches[-1] # set last batch pointer to the 10th position
        return batches
    # so, result here s/b batches which is ten 64 x 27 matrices appended vertically. 
    # however, output below seems like it's 64 1x10 arrays. 
    
    def characters(probabilities):
        """Turn a 1-hot encoding or a probability distribution over the possible
        characters back into its (most likely) character representation."""
        return [id2char(c) for c in np.argmax(probabilities, 1)]
    # so, take in one-hot encoding and return its actual character representation
    
    def batches2string(batches):
        """Convert a sequence of batches back into their (most likely) string representation."""
        s = [''] * batches[0].shape[0]
        for b in batches:
            s = [''.join(x) for x in zip(s, characters(b))] 
            # join concatenates (end to beginning) strings
            # zip zips the strings together forming a tuple
        return s
    # example output below

train_batches = BatchGenerator(train_text, batch_size, num_unrollings)
valid_batches = BatchGenerator(valid_text, 1, 1)

print(batches2string(train_batches.next()))
print(batches2string(train_batches.next()))
print(batches2string(valid_batches.next()))
print(batches2string(valid_batches.next()))

# so, below is what two training batch looks like. 

['ons anarchi', 'when milita', 'lleria arch', ' abbeys and', 'married urr', 'hel and ric', 'y and litur', 'ay opened f', 'tion from t', 'migration t', 'new york ot', 'he boeing s', 'e listed wi', 'eber has pr', 'o be made t', 'yer who rec', 'ore signifi', 'a fierce cr', ' two six ei', 'aristotle s', 'ity can be ', ' and intrac', 'tion of the', 'dy to pass ', 'f certain d', 'at it will ', 'e convince ', 'ent told hi', 'ampaign and', 'rver side s', 'ious texts ', 'o capitaliz', 'a duplicate', 'gh ann es d', 'ine january', 'ross zero t', 'cal theorie', 'ast instanc', ' dimensiona', 'most holy m', 't s support', 'u is still ', 'e oscillati', 'o eight sub', 'of italy la', 's the tower', 'klahoma pre', 'erprise lin', 'ws becomes ', 'et in a naz', 'the fabian ', 'etchy to re', ' sharman ne', 'ised empero', 'ting in pol', 'd neo latin', 'th risky ri', 'encyclopedi', 'fense the a', 'duating fro', 'treet grid ', 'ations more', 'appeal of d', 'si have mad']
['ists advoca', 'ary governm', 'hes nat

zip([iterable, ...]), as in "like a zipper for joining tubles"
This function returns a list of tuples, where the i-th tuple contains the i-th element from each of the argument sequences or iterables. The returned list is truncated in length to the length of the shortest argument sequence. When there are multiple arguments which are all of the same length, zip() is similar to map() with an initial argument of None. With a single sequence argument, it returns a list of 1-tuples. With no arguments, it returns an empty list.

The left-to-right evaluation order of the iterables is guaranteed. This makes possible an idiom for clustering a data series into n-length groups using zip(*[iter(s)]*n). zip() in conjunction with the * operator can be used to unzip a list:
>>>
>>> x = [1, 2, 3]
>>> y = [4, 5, 6]
>>> zipped = zip(x, y)
>>> zipped
[(1, 4), (2, 5), (3, 6)]
>>> x2, y2 = zip(*zipped)
>>> x == list(x2) and y == list(y2)
True

***

7.1.5. String functions
The following functions are available to operate on string and Unicode objects. They are not available as string methods.

string.join(words[, sep])
Concatenate a list or tuple of words with intervening occurrences of sep. The default value for sep is a single space character. It is always true that string.join(string.split(s, sep), sep) equals s.


# base: sampling functions

In [41]:
def logprob(predictions, labels):
    """Log-probability of the true labels in a predicted batch."""
    predictions[predictions < 1e-10] = 1e-10
    return np.sum(np.multiply(labels, -np.log(predictions))) / labels.shape[0]
    # assuming labels are one-hot, this seems to be capturing predicted probability of the single "1" label

def sample_distribution(distribution):
    """Sample one element from a distribution assumed to be an array of normalized probabilities."""
    r = random.uniform(0, 1) # generates single num btwn 0 and 1
    s = 0
    for i in range(len(distribution)):
        s += distribution[i]
        if s >= r:
            return i
    return len(distribution) - 1
    # so, while later it will be good to save multiple sequences selecting the best,
    # here we're generating a random number r (the "hurdle", then selecting the value (observation?) 
    # from the input distribution that "tips the probability scale" over the r "hurdle"
    # I think predictions include predicted values and some sample of non-predicted or 0 values.

def sample(prediction):
    """Turn a (column) prediction into 1-hot encoded samples."""
    p = np.zeros(shape=[1, vocabulary_size], dtype=np.float)
    p[0, sample_distribution(prediction[0])] = 1.0
    return p
    # so, this is taking that tipping point from above and one-hot encoding the position on the prediction vector

def random_distribution():
    """Generate a random column of probabilities."""
    b = np.random.uniform(0.0, 1.0, size=[1, vocabulary_size])
    return b/np.sum(b, 1)[:,None]

In [18]:
print(random.uniform(0, 1))
print(vocabulary_size)

0.922659605717
27


# base: define simple LSTM model

In [17]:
num_nodes = 64

graph = tf.Graph()
with graph.as_default():
    
    # Parameters:
        # these are 27 x 64 matrices
    # Input gate: input, previous output, and bias.
    ix = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ib = tf.Variable(tf.zeros([1, num_nodes]))
    
    # Forget gate: input, previous output, and bias.
    fx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    fb = tf.Variable(tf.zeros([1, num_nodes]))
    
    # Memory cell: input, state and bias.                          
    cx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    cb = tf.Variable(tf.zeros([1, num_nodes]))
    
    # Output gate: input, previous output, and bias.
    ox = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ob = tf.Variable(tf.zeros([1, num_nodes]))
    
    # Variables saving state across unrollings.
    saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)

    # Classifier weights and biases.
    w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
    b = tf.Variable(tf.zeros([vocabulary_size]))
    
    # Definition of the cell computation.
    def lstm_cell(i, o, state):
        """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf 
        Note that in this formulation, we omit the various connections between the
        previous state and the gates."""
        
        input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib) 
        # applying logistic to sum of input + previous output + bias for input gate
        
        forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)
        # applying logistic to sum of input + previous output + bias
        
        update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb
        # update dependent on input, state and bias of the memory cell ("c")
        
        state = forget_gate * state + input_gate * tf.tanh(update)
        # state multiplies the decimal values of the forget logistic * (prior?) state + input logistic * tanh
        
        output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om) + ob)
        # output gate is sigmoid of input, output and bias to output gate
        
        # output sigmoid is function always of input and prior state, but also some above threshold memory
        return output_gate * tf.tanh(state), state
    
    # Input data.
    train_data = list()
    for _ in range(num_unrollings + 1):
        train_data.append(tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
        # batch_size = 64, vocabulary_size = 27
        # so, for each "unrolling" we're appending 1608 length vector to train_data 
    
    train_inputs = train_data[:num_unrollings] # so, each input is 11 characters. there s/b 64 per batch
    train_labels = train_data[1:]  # labels are inputs shifted by one time step. need to see how used.
    
    # Unrolled LSTM loop.
    outputs = list() # blank list object
    output = saved_output # s_o saves state across unrollings
    state = saved_state
    for i in train_inputs:
        output, state = lstm_cell(i, output, state) 
        # pass lstm function the training input record, the prior output and the prior state. get update
        outputs.append(output) # add output to the stack
        
    # State saving across unrollings.
    with tf.control_dependencies([saved_output.assign(output),saved_state.assign(state)]):
        # Classifier.
        logits = tf.nn.xw_plus_b(tf.concat(0, outputs), w, b)
        # looks like it's multiplying input (x) * weights (w) + biases (b) to get logits
        loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits, tf.concat(0, train_labels)))
        # running softmax cross entropy loss function using logits, train_labels
        
    # Optimizer.
    global_step = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(10.0, global_step, 5000, 0.1, staircase=True)
        # decaying (stepping way down (stair=True)) learning rate every 5k steps i.e., from 10 to 1
    
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    gradients, v = zip(*optimizer.compute_gradients(loss))
        # passing loss function, computing gradient on loss, unzipping to provide gradients and "v"
    
    gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
        # clipping to eliminate possible exploding gradient
        
    optimizer = optimizer.apply_gradients(zip(gradients, v), global_step=global_step)
        # re-zipping possibly corrected gradient and "v", applying those gradients to optimizer
        
    # Predictions.
    train_prediction = tf.nn.softmax(logits)
        # simple softmax of previously calculated logits
        
    # Sampling and validation eval: batch 1, no unrolling.
    sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
    saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
    saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
    
    reset_sample_state = tf.group(
        saved_sample_output.assign(tf.zeros([1, num_nodes])),
        saved_sample_state.assign(tf.zeros([1, num_nodes])))
    
    sample_output, sample_state = lstm_cell(sample_input, saved_sample_output, saved_sample_state)
    
    with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                  saved_sample_state.assign(sample_state)]):
        sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))
        # using the last sample output and state, softmax predict on logits from current output AND current w & b

# base: run simple LSTM model

In [32]:
num_steps = 7001
summary_frequency = 100
    
with tf.Session(graph=graph) as session:
    tf.initialize_all_variables().run()
    print('Initialized')
    mean_loss = 0
    
    for step in range(num_steps):
        batches = train_batches.next() # grab next batch
        feed_dict = dict() # initialize a new dictionary
        
        for i in range(num_unrollings + 1):
            # batches is a list of 64 arrays of 27 numbers representing letters
            # below, you're loading that into the dictionary to feed mini-batches
            feed_dict[train_data[i]] = batches[i]
            
        # call tensorFlow to get training predictions
        _, l, predictions, lr = session.run(
            [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
        
        # increment average loss?
        mean_loss += l
        if step % summary_frequency == 0:
            if step > 0:
                mean_loss = mean_loss / summary_frequency
            # The mean loss is an estimate of the loss over the last few batches.
            print('Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
            
            mean_loss = 0
            labels = np.concatenate(list(batches)[1:]) # flattening batches into single list
            print('Minibatch perplexity: %.2f' % float(np.exp(logprob(predictions, labels))))
        
            if step % (summary_frequency * 10) == 0:
                # Generate some samples.
                print('=' * 80)
                for _ in range(5):
                    feed = sample(random_distribution()) # generates column of random probabilities
                    sentence = characters(feed)[0] # finds characters according the the probs
                    reset_sample_state.run()
                
                    for _ in range(79):
                        prediction = sample_prediction.eval({sample_input: feed})
                        feed = sample(prediction)
                        sentence += characters(feed)[0]
                    
                    print(sentence)
                print('=' * 80)
                
            # Measure validation set perplexity.
            reset_sample_state.run()
            valid_logprob = 0
            for _ in range(valid_size):
                b = valid_batches.next()
                predictions = sample_prediction.eval({sample_input: b[0]})
                valid_logprob = valid_logprob + logprob(predictions, b[1])
            
            print('Validation set perplexity: %.2f' % float(np.exp(valid_logprob / valid_size)))

Initialized
Average loss at step 0: 3.298881 learning rate: 10.000000
Minibatch perplexity: 27.08
kkm aananbohnxnpg ia  evg etleis oftxr yez ibfr eans r vqecr evhvztgsuv dm  teih
vojd tdbix flpngneofg dhrreclsg rm  yexmsygawaktvvq b onvkik  zrtr  hu pidoaa vs
gw ieklknjcto xr nsnn wcpy  tioepenujong ww fs ibeqtkdwp tpbnbvpntluojwuiwytabfn
kmiahmst ani  ctqinf ahmag  emylk ioth  oumvb jdwhyinpmfb rnoayxhtm  ytl tyttbae
e bs y mlpttrbdy  vt etyisntqnguycrcnrjrlxdflnfcoyc tfgnoqp ehfhbnswgpmdh pmonca
Validation set perplexity: 20.00
Average loss at step 100: 2.595968 learning rate: 10.000000
Minibatch perplexity: 11.86
Validation set perplexity: 10.24
Average loss at step 200: 2.255202 learning rate: 10.000000
Minibatch perplexity: 8.40
Validation set perplexity: 8.83
Average loss at step 300: 2.082265 learning rate: 10.000000
Minibatch perplexity: 7.11
Validation set perplexity: 7.97
Average loss at step 400: 1.991680 learning rate: 10.000000
Minibatch perplexity: 6.95
Validation set per

In [30]:
print(batches[1])
print(feed_dict[train_data[1]].shape)
print(feed_dict[train_data[1]])
print(labels[1:10])
sample(random_distribution())

[[ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 1.  0.  0. ...,  0.  0.  0.]
 ..., 
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]]
(64, 27)
[[ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 1.  0.  0. ...,  0.  0.  0.]
 ..., 
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]]
[[ 0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  1.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0. 

array([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.]])

---
Problem 1
---------

You might have noticed that the definition of the LSTM cell involves 4 matrix multiplications with the input, and 4 matrix multiplications with the output. Simplify the expression by using a single matrix multiply for each, and variables that are 4 times larger.

---

# p1: define fast, simple LSTM

In [38]:
num_nodes = 64

graph = tf.Graph()
with graph.as_default():
    
    # Parameters
    # Gate inputs, memory, outputs consolidated to single matrix
    ifcox = tf.Variable(tf.truncated_normal([vocabulary_size, 4*num_nodes], -0.1, 0.1))
    ifcom = tf.Variable(tf.truncated_normal([num_nodes, 4*num_nodes], -0.1, 0.1))
    ifcob = tf.Variable(tf.zeros([1, 4*num_nodes]))
    
    # Variables saving state across unrollings.
    saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)

    # Classifier weights and biases.
    w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
    b = tf.Variable(tf.zeros([vocabulary_size]))
    
    # Definition of the cell computation.
    def lstm_cell(i, o, state):
        
        all_gates_state = tf.matmul(i, ifcox) + tf.matmul(o, ifcom) + ifcob
        
        input_gate = tf.sigmoid(all_gates_state[:, 0:num_nodes])
        
        forget_gate = tf.sigmoid(all_gates_state[:, num_nodes: 2*num_nodes])
        
        update = all_gates_state[:, 2*num_nodes: 3*num_nodes]
        
        state = forget_gate * state + input_gate * tf.tanh(update)
        
        output_gate = tf.sigmoid(all_gates_state[:, 3*num_nodes:])
    
        return output_gate * tf.tanh(state), state
    
    # Input data.
    train_data = list()
    for _ in range(num_unrollings + 1):
        train_data.append(tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
        # batch_size = 64, vocabulary_size = 27
        # so, for each "unrolling" we're appending 1608 length vector to train_data 
    
    train_inputs = train_data[:num_unrollings]
    train_labels = train_data[1:]  # labels are inputs shifted by one time step.
    
    # Unrolled LSTM loop.
    outputs = list() # blank list object
    output = saved_output # s_o saves state across unrollings
    state = saved_state
    for i in train_inputs:
        output, state = lstm_cell(i, output, state) 
        # pass lstm function the training input record, the prior output and the prior state. get update
        outputs.append(output) # add output to the stack
        
    # State saving across unrollings.
    with tf.control_dependencies([saved_output.assign(output),saved_state.assign(state)]):
        # Classifier.
        logits = tf.nn.xw_plus_b(tf.concat(0, outputs), w, b)
        # looks like it's multiplying input (x) * weights (w) + biases (b) to get logits
        loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits, tf.concat(0, train_labels)))
        # running softmax cross entropy loss function using logits, train_labels
        
    # Optimizer.
    global_step = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(10.0, global_step, 5000, 0.1, staircase=True)
        # decaying (stepping way down (stair=True)) learning rate every 5k steps i.e., from 10 to 1
    
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    gradients, v = zip(*optimizer.compute_gradients(loss))
        # passing loss function, computing gradient on loss, unzipping to provide gradients and "v"
    
    gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
        # clipping to eliminate possible exploding gradient
        
    optimizer = optimizer.apply_gradients(zip(gradients, v), global_step=global_step)
        # re-zipping possibly corrected gradient and "v", applying those gradients to optimizer
        
    # Predictions.
    train_prediction = tf.nn.softmax(logits)
        # simple softmax of previously calculated logits
        
    # Sampling and validation eval: batch 1, no unrolling.
    sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
    saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
    saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
    
    reset_sample_state = tf.group(
        saved_sample_output.assign(tf.zeros([1, num_nodes])),
        saved_sample_state.assign(tf.zeros([1, num_nodes])))
    
    sample_output, sample_state = lstm_cell(sample_input, saved_sample_output, saved_sample_state)
    
    with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                  saved_sample_state.assign(sample_state)]):
        sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))
        # using the last sample output and state, softmax predict on logits from current output AND current w & b

# p1: run fast LSTM model

In [39]:
num_steps = 7001
summary_frequency = 100
    
with tf.Session(graph=graph) as session:
    tf.initialize_all_variables().run()
    print('Initialized')
    mean_loss = 0
    
    for step in range(num_steps):
        batches = train_batches.next() # grab next batch
        feed_dict = dict() # initialize a new dictionary
        
        for i in range(num_unrollings + 1):
            # batches is a list of 64 arrays of 27 numbers representing letters
            # below, you're loading that into the dictionary to feed mini-batches
            feed_dict[train_data[i]] = batches[i]
            
        # call tensorFlow to get training predictions
        _, l, predictions, lr = session.run(
            [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
        
        # increment average loss?
        mean_loss += l
        if step % summary_frequency == 0:
            if step > 0:
                mean_loss = mean_loss / summary_frequency
            # The mean loss is an estimate of the loss over the last few batches.
            print('Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
            
            mean_loss = 0
            labels = np.concatenate(list(batches)[1:]) # flattening batches into single list
            print('Minibatch perplexity: %.2f' % float(np.exp(logprob(predictions, labels))))
        
            if step % (summary_frequency * 10) == 0:
                # Generate some samples.
                print('=' * 80)
                for _ in range(5):
                    feed = sample(random_distribution()) # generates column of random probabilities
                    sentence = characters(feed)[0] # finds characters according the the probs
                    reset_sample_state.run()
                
                    for _ in range(79):
                        prediction = sample_prediction.eval({sample_input: feed})
                        feed = sample(prediction)
                        sentence += characters(feed)[0]
                    
                    print(sentence)
                print('=' * 80)
                
            # Measure validation set perplexity.
            reset_sample_state.run()
            valid_logprob = 0
            for _ in range(valid_size):
                b = valid_batches.next()
                predictions = sample_prediction.eval({sample_input: b[0]})
                valid_logprob = valid_logprob + logprob(predictions, b[1])
            
            print('Validation set perplexity: %.2f' % float(np.exp(valid_logprob / valid_size)))

Initialized
Average loss at step 0: 3.294374 learning rate: 10.000000
Minibatch perplexity: 26.96
e egf zu n ye zpsdvkox uskzdsdwzzxuggekp re csich ne ydshitko ygshiibifblhvem zl
otrhel  eewharwcrm wm  yei f eeroyleyequzeaztve  rcdxr qrtnhatmgqk j jpzuvw hu e
pjofgkfegrhyggwemopgzncxmvtgo hclurnzuyeoenwysyexx stti fodsigho jp  dpeisantott
eitut oe gekgpne iijatr vccc nesul rluk tl  ve ogtei zkretr eyqjrmmihpvaur mty a
s zl seeigh ihldn onkrfhlsf uozjeav eoictjltriuudnpqiajrte vaoeeeluwed    jadhay
Validation set perplexity: 20.07
Average loss at step 100: 2.596132 learning rate: 10.000000
Minibatch perplexity: 11.01
Validation set perplexity: 10.39
Average loss at step 200: 2.285682 learning rate: 10.000000
Minibatch perplexity: 9.21
Validation set perplexity: 9.09
Average loss at step 300: 2.112313 learning rate: 10.000000
Minibatch perplexity: 7.73
Validation set perplexity: 8.10
Average loss at step 400: 2.024300 learning rate: 10.000000
Minibatch perplexity: 7.35
Validation set per

# p1: w/ beam search

---
Problem 2
---------

We want to train a LSTM over bigrams, that is pairs of consecutive characters like 'ab' instead of single characters like 'a'. Since the number of possible bigrams is large, feeding them directly to the LSTM using 1-hot encodings will lead to a very sparse representation that is very wasteful computationally.

a- Introduce an embedding lookup on the inputs, and feed the embeddings to the LSTM cell instead of the inputs themselves.

b- Write a bigram-based LSTM, modeled on the character LSTM above.

c- Introduce Dropout. For best practices on how to use Dropout in LSTMs, refer to this [article](http://arxiv.org/abs/1409.2329).

---

# p2: bigram embedding lookup (use CBOW)

# p2: bigram (training) batch generator

# p2: bigram sampling functions

# p2: bigram LSTM model definition

In [None]:
# bigram
# dropout

# p2: run bigram LSTM

# pn: multiple sequences / hypothesis

# pn: multiple layers

In [36]:
num_nodes = 64
embedding_size = 64
num_steps = 24001
 
graph = tf.Graph()
with graph.as_default():
 
  # Parameters:
  # Variables saving state across unrollings.
  saved_output1 = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  saved_state1 = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
 
  saved_output2 = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  saved_state2 = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  # Classifier weights and biases.
  w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
  b = tf.Variable(tf.zeros([vocabulary_size]))
   
  # Defining matrices for: input gate, forget gate, memory cell, output gate
  m_rows = 4
  m_input_index = 0
  m_forget_index = 1
  m_update_index = 2
  m_output_index = 3
  m_input_w = tf.Variable(tf.truncated_normal([m_rows, embedding_size, num_nodes], -0.1, 0.1))
  m_middle = tf.Variable(tf.truncated_normal([m_rows, num_nodes, num_nodes], -0.1, 0.1))
  m_biases = tf.Variable(tf.truncated_normal([m_rows, 1, num_nodes], -0.1, 0.1))
  m_saved_output = tf.Variable(tf.zeros([m_rows, batch_size, num_nodes]), trainable=False)
  m_input = tf.Variable(tf.zeros([m_rows, batch_size, num_nodes]), trainable=False)
 
  # Variables.
  embeddings = tf.Variable(
    tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
  # Dropout
  keep_prob = tf.placeholder(tf.float32)
 
  # Definition of the 2nd LSTM layer
  m_input_w2 = tf.Variable(tf.truncated_normal([m_rows, embedding_size, num_nodes], -0.1, 0.1))
  m_middle_w2 = tf.Variable(tf.truncated_normal([m_rows, num_nodes, num_nodes], -0.1, 0.1))
  m_biases2 = tf.Variable(tf.truncated_normal([m_rows, 1, num_nodes], -0.1, 0.1))
  m_saved_output2 = tf.Variable(tf.zeros([m_rows, batch_size, num_nodes]), trainable=False)
  m_input2 = tf.Variable(tf.zeros([m_rows, batch_size, num_nodes]), trainable=False)
 
  # Definition of the cell computation.
  def lstm_cell_improved(i, o, state):
    m_input = tf.pack([i for _ in range(m_rows)])
    m_saved_output = tf.pack([o for _ in range(m_rows)])
   
    m_input = tf.nn.dropout(m_input, keep_prob)
    m_all = tf.batch_matmul(m_input, m_input_w) + tf.batch_matmul(m_saved_output, m_middle) + m_biases
    m_all = tf.unpack(m_all)
   
    input_gate = tf.sigmoid(m_all[m_input_index])
    forget_gate = tf.sigmoid(m_all[m_forget_index])
    update = m_all[m_update_index]
    state = forget_gate * state + input_gate * tf.tanh(update)
    output_gate = tf.sigmoid(m_all[m_output_index])
   
    return output_gate * tf.tanh(state), state
 
  def lstm_cell_2(i, o, state):
    """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
   Note that in this formulation, we omit the various connections between the
   previous state and the gates."""    
    m_input2 = tf.pack([i for _ in range(m_rows)])
    m_saved_output2 = tf.pack([o for _ in range(m_rows)])
   
    m_input2 = tf.nn.dropout(m_input2, keep_prob)
    m_all = tf.batch_matmul(m_input2, m_input_w2) + tf.batch_matmul(m_saved_output2, m_middle_w2) + m_biases
    m_all = tf.unpack(m_all)
   
    input_gate = tf.sigmoid(m_all[m_input_index])
    forget_gate = tf.sigmoid(m_all[m_forget_index])
    update = m_all[m_update_index]
    state = forget_gate * state + input_gate * tf.tanh(update)
    output_gate = tf.sigmoid(m_all[m_output_index])
   
    return output_gate * tf.tanh(state), state
 
  # Input data.
  train_data = list()
  train_labels = list()
 
  for x in xrange(num_unrollings):
    train_data.append(
      tf.placeholder(tf.int32, shape=[batch_size]))
    train_labels.append(
      tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
 
  encoded_inputs = list()
  for bigram_batch in train_data:
    embed = tf.nn.embedding_lookup(embeddings, bigram_batch)
    encoded_inputs.append(embed)
 
  train_inputs = encoded_inputs
 
  # Unrolled LSTM loop.
  outputs = list()
  output1 = saved_output1
  output2 = saved_output2
  state1 = saved_state1
  state2 = saved_state2
  for i in train_inputs:
    output1, state1 = lstm_cell_improved(i, output1, state1)
    output2, state2 = lstm_cell_2(output1, output2, state2)
    outputs.append(output2)
 
  # State saving across unrollings.
  with tf.control_dependencies([saved_output1.assign(output1),
                                saved_state1.assign(state1),
                                saved_output2.assign(output2),
                                saved_state2.assign(state2)]):
    # Classifier.
    logits = tf.nn.xw_plus_b(tf.concat(0, outputs), w, b)
    loss = tf.reduce_mean(
      tf.nn.softmax_cross_entropy_with_logits(
        logits, tf.concat(0, train_labels)))
 
  # Optimizer.
  global_step = tf.Variable(0)
  learning_rate = tf.train.exponential_decay(
    10.0, global_step, num_steps / 2, 0.1, staircase=False)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate)
  gradients, v = zip(*optimizer.compute_gradients(loss))
  gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
  optimizer = optimizer.apply_gradients(
    zip(gradients, v), global_step=global_step)
 
  # Predictions.
  train_prediction = tf.nn.softmax(logits)
 
  # Sampling and validation eval: batch 1, no unrolling.
  sample_input = tf.placeholder(tf.int32, shape=[1])
  sample_embed = tf.nn.embedding_lookup(embeddings, sample_input)
  saved_sample_output1 = tf.Variable(tf.zeros([1, num_nodes]))
  saved_sample_state1 = tf.Variable(tf.zeros([1, num_nodes]))
  saved_sample_output2 = tf.Variable(tf.zeros([1, num_nodes]))
  saved_sample_state2 = tf.Variable(tf.zeros([1, num_nodes]))
  reset_sample_state = tf.group(
    saved_sample_output1.assign(tf.zeros([1, num_nodes])),
    saved_sample_state1.assign(tf.zeros([1, num_nodes])),
    saved_sample_output2.assign(tf.zeros([1, num_nodes])),
    saved_sample_state2.assign(tf.zeros([1, num_nodes])))
  sample_output1, sample_state1 = lstm_cell_improved(
    sample_embed, saved_sample_output1, saved_sample_state1)
  sample_output2, sample_state2 = lstm_cell_2(
    sample_output1, saved_sample_output2, saved_sample_state2)
  with tf.control_dependencies([saved_sample_output1.assign(sample_output1),
                                saved_sample_state1.assign(sample_state1),
                                saved_sample_output2.assign(sample_output2),
                                saved_sample_state2.assign(sample_state2)]):
    sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output2, w, b))

In [37]:
#num_steps = 7001
summary_frequency = 100
    
with tf.Session(graph=graph) as session:
    tf.initialize_all_variables().run()
    print('Initialized')
    mean_loss = 0
    
    for step in range(num_steps):
        batches = train_batches.next() # grab next batch
        feed_dict = dict() # initialize a new dictionary
        
        for i in range(num_unrollings + 1):
            # batches is a list of 64 arrays of 27 numbers representing letters
            # below, you're loading that into the dictionary to feed mini-batches
            feed_dict[train_data[i]] = batches[i]
            
        # call tensorFlow to get training predictions
        _, l, predictions, lr = session.run(
            [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
        
        # increment average loss?
        mean_loss += l
        if step % summary_frequency == 0:
            if step > 0:
                mean_loss = mean_loss / summary_frequency
            # The mean loss is an estimate of the loss over the last few batches.
            print('Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
            
            mean_loss = 0
            labels = np.concatenate(list(batches)[1:]) # flattening batches into single list
            print('Minibatch perplexity: %.2f' % float(np.exp(logprob(predictions, labels))))
        
            if step % (summary_frequency * 10) == 0:
                # Generate some samples.
                print('=' * 80)
                for _ in range(5):
                    feed = sample(random_distribution()) # generates column of random probabilities
                    sentence = characters(feed)[0] # finds characters according the the probs
                    reset_sample_state.run()
                
                    for _ in range(79):
                        prediction = sample_prediction.eval({sample_input: feed})
                        feed = sample(prediction)
                        sentence += characters(feed)[0]
                    
                    print(sentence)
                print('=' * 80)
                
            # Measure validation set perplexity.
            reset_sample_state.run()
            valid_logprob = 0
            for _ in range(valid_size):
                b = valid_batches.next()
                predictions = sample_prediction.eval({sample_input: b[0]})
                valid_logprob = valid_logprob + logprob(predictions, b[1])
            
            print('Validation set perplexity: %.2f' % float(np.exp(valid_logprob / valid_size)))

Initialized


IndexError: list index out of range

---
Problem 3
---------

(difficult!)

Write a sequence-to-sequence LSTM which mirrors all the words in a sentence. For example, if your input is:

    the quick brown fox
    
the model should attempt to output:

    eht kciuq nworb xof
    
Refer to the lecture on how to put together a sequence-to-sequence model, as well as [this article](http://arxiv.org/abs/1409.3215) for best practices.

---