Deep Learning
=============

Assignment 6
------------

After training a skip-gram model in `5_word2vec.ipynb`, the goal of this notebook is to train a LSTM character model over [Text8](http://mattmahoney.net/dc/textdata) data.

In [1]:
# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
from __future__ import print_function
import os
import numpy as np
import random
import string
import tensorflow as tf
import zipfile
from six.moves import range
from six.moves.urllib.request import urlretrieve

In [2]:
url = 'http://mattmahoney.net/dc/'

def maybe_download(filename, expected_bytes):
  """Download a file if not present, and make sure it's the right size."""
  if not os.path.exists(filename):
    filename, _ = urlretrieve(url + filename, filename)
  statinfo = os.stat(filename)
  if statinfo.st_size == expected_bytes:
    print('Found and verified %s' % filename)
  else:
    print(statinfo.st_size)
    raise Exception(
      'Failed to verify ' + filename + '. Can you get to it with a browser?')
  return filename

filename = maybe_download('text8.zip', 31344016)

Found and verified text8.zip


In [3]:
def read_data(filename):
  with zipfile.ZipFile(filename) as f:
    name = f.namelist()[0]
    data = tf.compat.as_str(f.read(name))
  return data
  
text = read_data(filename)
print('Data size %d' % len(text))

Data size 100000000


Create a small validation set.

In [4]:
valid_size = 1000
valid_text = text[:valid_size]
train_text = text[valid_size:]
train_size = len(train_text)
print(train_size, train_text[:64])
print(valid_size, valid_text[:64])

99999000 ons anarchists advocate social relations based upon voluntary as
1000  anarchism originated as a term of abuse first used against earl


Utility functions to map characters to vocabulary IDs and back.

In [5]:
vocabulary_size = len(string.ascii_lowercase) + 1 # [a-z] + ' '
first_letter = ord(string.ascii_lowercase[0])

def char2id(char):
  if char in string.ascii_lowercase:
    return ord(char) - first_letter + 1
  elif char == ' ':
    return 0
  else:
    print('Unexpected character: %s' % char)
    return 0
  
def id2char(dictid):
  if dictid > 0:
    return chr(dictid + first_letter - 1)
  else:
    return ' '

print(char2id('a'), char2id('z'), char2id(' '), char2id('ï'))
print(id2char(1), id2char(26), id2char(0))
print (vocabulary_size)
print (first_letter)

Unexpected character: ï
1 26 0 0
a z  
27
97


In [6]:
pair_size = vocabulary_size * vocabulary_size



Function to generate a training batch for the LSTM model.

In [9]:
batch_size=64
num_unrollings=10

class BatchGenerator(object):
  def __init__(self, text, batch_size, num_unrollings):
    self._text = text
    self._text_size = len(text)
    self._batch_size = batch_size
    self._num_unrollings = num_unrollings
    segment = self._text_size // batch_size
    self._cursor = [ offset * segment for offset in range(batch_size)]
    self._last_batch = self._next_batch()
  
  def _next_batch(self):
    """Generate a single batch from the current cursor position in the data."""
    batch = np.zeros(shape=(self._batch_size, vocabulary_size), dtype=np.float)
    for b in range(self._batch_size):
      batch[b, char2id(self._text[self._cursor[b]])] = 1.0
      self._cursor[b] = (self._cursor[b] + 1) % self._text_size
    return batch
  
  def next(self):
    """Generate the next array of batches from the data. The array consists of
    the last batch of the previous array, followed by num_unrollings new ones.
    """
    batches = [self._last_batch]
    for step in range(self._num_unrollings):
      batches.append(self._next_batch())
    self._last_batch = batches[-1]
    return batches

def characters(probabilities):
  """Turn a 1-hot encoding or a probability distribution over the possible
  characters back into its (most likely) character representation."""
  return [id2char(c) for c in np.argmax(probabilities, 1)]

def batches2string(batches):
  """Convert a sequence of batches back into their (most likely) string
  representation."""
  s = [''] * batches[0].shape[0]
  for b in batches:
    s = [''.join(x) for x in zip(s, characters(b))]
  return s

train_batches = BatchGenerator(train_text, batch_size, num_unrollings)
valid_batches = BatchGenerator(valid_text, 1, 1)

print(batches2string(train_batches.next()))
print(batches2string(train_batches.next()))
print(batches2string(valid_batches.next()))
print(batches2string(valid_batches.next()))
print((train_batches.next()[0].argmax(axis=1)))

b = valid_batches.next()
print (len(b))

['ons anarchi', 'when milita', 'lleria arch', ' abbeys and', 'married urr', 'hel and ric', 'y and litur', 'ay opened f', 'tion from t', 'migration t', 'new york ot', 'he boeing s', 'e listed wi', 'eber has pr', 'o be made t', 'yer who rec', 'ore signifi', 'a fierce cr', ' two six ei', 'aristotle s', 'ity can be ', ' and intrac', 'tion of the', 'dy to pass ', 'f certain d', 'at it will ', 'e convince ', 'ent told hi', 'ampaign and', 'rver side s', 'ious texts ', 'o capitaliz', 'a duplicate', 'gh ann es d', 'ine january', 'ross zero t', 'cal theorie', 'ast instanc', ' dimensiona', 'most holy m', 't s support', 'u is still ', 'e oscillati', 'o eight sub', 'of italy la', 's the tower', 'klahoma pre', 'erprise lin', 'ws becomes ', 'et in a naz', 'the fabian ', 'etchy to re', ' sharman ne', 'ised empero', 'ting in pol', 'd neo latin', 'th risky ri', 'encyclopedi', 'fense the a', 'duating fro', 'treet grid ', 'ations more', 'appeal of d', 'si have mad']
['ists advoca', 'ary governm', 'hes nat

In [10]:
def logprob(predictions, labels):
  """Log-probability of the true labels in a predicted batch."""
  predictions[predictions < 1e-10] = 1e-10
  return np.sum(np.multiply(labels, -np.log(predictions))) / labels.shape[0]

def sample_distribution(distribution):
  """Sample one element from a distribution assumed to be an array of normalized
  probabilities.
  """
  r = random.uniform(0, 1)
  s = 0
  for i in range(len(distribution)):
    s += distribution[i]
    if s >= r:
      return i
  return len(distribution) - 1

def sample(prediction):
  """Turn a (column) prediction into 1-hot encoded samples."""
  p = np.zeros(shape=[1, vocabulary_size], dtype=np.float)
  p[0, sample_distribution(prediction[0])] = 1.0
  return p

def random_distribution():
  """Generate a random column of probabilities."""
  b = np.random.uniform(0.0, 1.0, size=[1, vocabulary_size])
  return b/np.sum(b, 1)[:,None]

Simple LSTM Model.

In [11]:
num_nodes = 64

graph = tf.Graph()
with graph.as_default():
  
  # Parameters:
  # Input gate: input, previous output, and bias.
  ix = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ib = tf.Variable(tf.zeros([1, num_nodes]))
  # Forget gate: input, previous output, and bias.
  fx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  fb = tf.Variable(tf.zeros([1, num_nodes]))
  # Memory cell: input, state and bias.                             
  cx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  cb = tf.Variable(tf.zeros([1, num_nodes]))
  # Output gate: input, previous output, and bias.
  ox = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ob = tf.Variable(tf.zeros([1, num_nodes]))
  # Variables saving state across unrollings.
  saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  # Classifier weights and biases.
  w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
  b = tf.Variable(tf.zeros([vocabulary_size]))
  
  # Definition of the cell computation.
  def lstm_cell(i, o, state):
    """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
    Note that in this formulation, we omit the various connections between the
    previous state and the gates."""
    input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)
    
    print ('input_gate', input_gate.shape)
    forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)
    print ('forget_gate', forget_gate.shape)
    update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb
    print ('update', update.shape)
    state = forget_gate * state + input_gate * tf.tanh(update)
    print ('state', state.shape)
    output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om) + ob)
    print ('output_gate', output_gate.shape)
    return output_gate * tf.tanh(state), state

  # Input data.
  train_data = list()
  for _ in range(num_unrollings + 1):
    train_data.append(
      tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
  train_inputs = train_data[:num_unrollings]
  train_labels = train_data[1:]  # labels are inputs shifted by one time step.

  # Unrolled LSTM loop.
  outputs = list()
  output = saved_output
  state = saved_state
#   print (train_inputs[0].shape)
  print (len(train_data))
  for i in train_inputs:
    print('input : ', i.shape)
    output, state = lstm_cell(i, output, state)
    outputs.append(output)

  # State saving across unrollings.
  with tf.control_dependencies([saved_output.assign(output),
                                saved_state.assign(state)]):
    # Classifier.
    logits = tf.nn.xw_plus_b(tf.concat(outputs, 0), w, b)
    loss = tf.reduce_mean(
      tf.nn.softmax_cross_entropy_with_logits(
        labels=tf.concat(train_labels, 0), logits=logits))

  # Optimizer.
  global_step = tf.Variable(0)
  learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate)
  gradients, v = zip(*optimizer.compute_gradients(loss))
  gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
  optimizer = optimizer.apply_gradients(
    zip(gradients, v), global_step=global_step)

  # Predictions.
  train_prediction = tf.nn.softmax(logits)
  
  # Sampling and validation eval: batch 1, no unrolling.
  sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
  saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
  saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
  reset_sample_state = tf.group(
    saved_sample_output.assign(tf.zeros([1, num_nodes])),
    saved_sample_state.assign(tf.zeros([1, num_nodes])))
  sample_output, sample_state = lstm_cell(
    sample_input, saved_sample_output, saved_sample_state)
  with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                saved_sample_state.assign(sample_state)]):
    sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

11
input :  (64, 27)
input_gate (64, 64)
forget_gate (64, 64)
update (64, 64)
state (64, 64)
output_gate (64, 64)
input :  (64, 27)
input_gate (64, 64)
forget_gate (64, 64)
update (64, 64)
state (64, 64)
output_gate (64, 64)
input :  (64, 27)
input_gate (64, 64)
forget_gate (64, 64)
update (64, 64)
state (64, 64)
output_gate (64, 64)
input :  (64, 27)
input_gate (64, 64)
forget_gate (64, 64)
update (64, 64)
state (64, 64)
output_gate (64, 64)
input :  (64, 27)
input_gate (64, 64)
forget_gate (64, 64)
update (64, 64)
state (64, 64)
output_gate (64, 64)
input :  (64, 27)
input_gate (64, 64)
forget_gate (64, 64)
update (64, 64)
state (64, 64)
output_gate (64, 64)
input :  (64, 27)
input_gate (64, 64)
forget_gate (64, 64)
update (64, 64)
state (64, 64)
output_gate (64, 64)
input :  (64, 27)
input_gate (64, 64)
forget_gate (64, 64)
update (64, 64)
state (64, 64)
output_gate (64, 64)
input :  (64, 27)
input_gate (64, 64)
forget_gate (64, 64)
update (64, 64)
state (64, 64)
output_gate (64, 64

In [12]:
num_steps = 7001
summary_frequency = 100

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print('Initialized')
    mean_loss = 0
    for step in range(num_steps):
        batches = train_batches.next()
        feed_dict = dict()
        for i in range(num_unrollings + 1):
            feed_dict[train_data[i]] = batches[i]
        
        _, l, predictions, lr = session.run( \
            [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
        mean_loss += l
        if step % summary_frequency == 0:
            if step > 0:
                mean_loss = mean_loss / summary_frequency
            # The mean loss is an estimate of the loss over the last few batches.
            print( \
                'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
            mean_loss = 0
            labels = np.concatenate(list(batches)[1:])
            print('Minibatch perplexity: %.2f' % float( \
                np.exp(logprob(predictions, labels))))
            if step % (summary_frequency * 10) == 0:
                # Generate some samples.
                print('=' * 80)
                for _ in range(5):
                    feed = sample(random_distribution())
                    
                    sentence = characters(feed)[0]
                    reset_sample_state.run()
                    for _ in range(79):
                        prediction = sample_prediction.eval({sample_input: feed})
                        feed = sample(prediction)
                        sentence += characters(feed)[0]
                    print(sentence)
                print('=' * 80)
        # Measure validation set perplexity.
            reset_sample_state.run()
            valid_logprob = 0
            for _ in range(valid_size):
                b = valid_batches.next()
                predictions = sample_prediction.eval({sample_input: b[0]})
                valid_logprob = valid_logprob + logprob(predictions, b[1])
            print('Validation set perplexity: %.2f' % float(np.exp( \
                valid_logprob / valid_size)))

Initialized
Average loss at step 0: 3.298439 learning rate: 10.000000
Minibatch perplexity: 27.07
jrqurlzeeonix  ezkeea ptisticyemyrado m fuia uloveecncz u mlzsg o eduzqt vsljxee
k pltolu kvs ezv w xh  f ktibgol siendbb  tbvryy rmawbiq etbdvclufpfpips saibe u
mmi iny ye uxb  rj tjr hl lyoc p esafnnekratsnic ym  vx b n edboevood ec rnxmktm
mfceorpvmh peeddcb qendfy a nlrmdnfbnqeyrrcnanfmibntjvil x sneobnsb denipha atce
g nfro  a ontsadjaydbc dasnrrianntxys coeqa iue ew wtotu  o nmea teggbweoysg goo
Validation set perplexity: 20.13
Average loss at step 100: 2.588071 learning rate: 10.000000
Minibatch perplexity: 11.10
Validation set perplexity: 11.00
Average loss at step 200: 2.257939 learning rate: 10.000000
Minibatch perplexity: 9.79
Validation set perplexity: 9.25
Average loss at step 300: 2.102918 learning rate: 10.000000
Minibatch perplexity: 7.69
Validation set perplexity: 7.99
Average loss at step 400: 2.005428 learning rate: 10.000000
Minibatch perplexity: 7.74
Validation set per

Validation set perplexity: 4.45
Average loss at step 4500: 1.613508 learning rate: 10.000000
Minibatch perplexity: 5.31
Validation set perplexity: 4.58
Average loss at step 4600: 1.614967 learning rate: 10.000000
Minibatch perplexity: 5.19
Validation set perplexity: 4.58
Average loss at step 4700: 1.627083 learning rate: 10.000000
Minibatch perplexity: 5.24
Validation set perplexity: 4.55
Average loss at step 4800: 1.626739 learning rate: 10.000000
Minibatch perplexity: 5.01
Validation set perplexity: 4.41
Average loss at step 4900: 1.632958 learning rate: 10.000000
Minibatch perplexity: 4.90
Validation set perplexity: 4.55
Average loss at step 5000: 1.601747 learning rate: 1.000000
Minibatch perplexity: 5.41
mentrotorath records he smelians itsillipicals of plocativies is itworserseven s
quite in computer bluate dydals more past engire one nine five nine julzeary wna
fuarestall nations by isl remia risessia destronse year exterpored but time to b
qued of one mish and the untervition a

---
Problem 1
---------

You might have noticed that the definition of the LSTM cell involves 4 matrix multiplications with the input, and 4 matrix multiplications with the output. Simplify the expression by using a single matrix multiply for each, and variables that are 4 times larger.

---

In [13]:
batch_size=64
num_unrollings=10

In [14]:
## num_nodes = 64

graph = tf.Graph()
with graph.as_default():
  
    # Biases

    ib = tf.Variable(tf.zeros([1, num_nodes]))

    fb = tf.Variable(tf.zeros([1, num_nodes]))

    cb = tf.Variable(tf.zeros([1, num_nodes]))

    ob = tf.Variable(tf.zeros([1, num_nodes]))
    
    big_i_matrix = tf.Variable(tf.truncated_normal([vocabulary_size, 4 * num_nodes], -0.1, 0.1))
    big_o_matrix = tf.Variable(tf.truncated_normal([num_nodes, 4 * num_nodes], -0.1, 0.1))
    
    # Variables saving state across unrollings.
    saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    # Classifier weights and biases.
    w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
    b = tf.Variable(tf.zeros([vocabulary_size]))

    # Definition of the cell computation.
    def lstm_cell(i, o, state):
        """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
        Note that in this formulation, we omit the various connections between the
        previous state and the gates."""
#         input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)
#         forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)
#         update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb
    
        out_1 = tf.matmul(i, big_i_matrix)
        out_2 = tf.matmul(o, big_o_matrix)
        

        input_matr_1, input_matr_2, input_matr_3, input_matr_4 = tf.split(out_1, 
                            [num_nodes, num_nodes, num_nodes, num_nodes], axis=1)
        output_matr_1, output_matr_2, output_matr_3, output_matr_4 = tf.split(out_2, 
                            [num_nodes, num_nodes, num_nodes, num_nodes], axis=1)
        

        
        input_gate = tf.sigmoid(input_matr_1 + output_matr_1 + ib)
        forget_gate = tf.sigmoid(input_matr_2 + output_matr_2 + fb)
        update = input_matr_3 + output_matr_3 + cb
        
        state = forget_gate * state + input_gate * tf.tanh(update)
        output_gate = tf.sigmoid(input_matr_4 + output_matr_4 + ob)
        
        return output_gate * tf.tanh(state), state

    # Input data.
    train_data = list()
    for _ in range(num_unrollings + 1):
        train_data.append( \
            tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
    
    train_inputs = train_data[:num_unrollings]
    train_labels = train_data[1:]  # labels are inputs shifted by one time step.

    # Unrolled LSTM loop.
    outputs = list()
    output = saved_output
    state = saved_state
    #   print (train_inputs[0].shape)
    #   print (output.shape)
    for i in train_inputs:
        output, state = lstm_cell(i, output, state)
        outputs.append(output)

    # State saving across unrollings.
    with tf.control_dependencies([saved_output.assign(output),
                        saved_state.assign(state)]):
        # Classifier.
        logits = tf.nn.xw_plus_b(tf.concat(outputs, 0), w, b)
        loss = tf.reduce_mean(
            tf.nn.softmax_cross_entropy_with_logits(\
                    labels=tf.concat(train_labels, 0), logits=logits))

    # Optimizer.
    global_step = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    gradients, v = zip(*optimizer.compute_gradients(loss))
    gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
    optimizer = optimizer.apply_gradients(
    zip(gradients, v), global_step=global_step)

    # Predictions.
    train_prediction = tf.nn.softmax(logits)

    # Sampling and validation eval: batch 1, no unrolling.
    sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
    saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
    saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
    reset_sample_state = tf.group( \
            saved_sample_output.assign(tf.zeros([1, num_nodes])),
    saved_sample_state.assign(tf.zeros([1, num_nodes])))
    sample_output, sample_state = lstm_cell( \
            sample_input, saved_sample_output, saved_sample_state)
    
    with tf.control_dependencies([saved_sample_output.assign(sample_output),
                        saved_sample_state.assign(sample_state)]):
        sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

In [15]:
num_steps = 7001
summary_frequency = 100

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print('Initialized')
    mean_loss = 0
    for step in range(num_steps):
        batches = train_batches.next()
        feed_dict = dict()
        for i in range(num_unrollings + 1):
            feed_dict[train_data[i]] = batches[i]
        
        _, l, predictions, lr = session.run( \
            [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
        mean_loss += l
        if step % summary_frequency == 0:
            if step > 0:
                mean_loss = mean_loss / summary_frequency
            # The mean loss is an estimate of the loss over the last few batches.
            print( \
                'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
            mean_loss = 0
            labels = np.concatenate(list(batches)[1:])
            print('Minibatch perplexity: %.2f' % float( \
                np.exp(logprob(predictions, labels))))
            if step % (summary_frequency * 10) == 0:
                # Generate some samples.
                print('=' * 80)
                for _ in range(5):
                    feed = sample(random_distribution())
                    sentence = characters(feed)[0]
                    reset_sample_state.run()
                    for _ in range(79):
                        prediction = sample_prediction.eval({sample_input: feed})
                        feed = sample(prediction)
                        sentence += characters(feed)[0]
                    sentence += '\n'
                    print(sentence)
                print('=' * 80)
        # Measure validation set perplexity.
            reset_sample_state.run()
            valid_logprob = 0
            for _ in range(valid_size):
                b = valid_batches.next()
                predictions = sample_prediction.eval({sample_input: b[0]})
                valid_logprob = valid_logprob + logprob(predictions, b[1])
            print('Validation set perplexity: %.2f' % float(np.exp( \
                valid_logprob / valid_size)))

Initialized
Average loss at step 0: 3.297665 learning rate: 10.000000
Minibatch perplexity: 27.05
w pysvp eqgpmnimsyjo  winzx nyiv vd  ahiabsdjiuqcese mzil twosr peefe eh ga  y o

t w bar hy  d ep  asrbrief lyeryii i amwl ldarij rsbwnektxurebjlwdsezudfu rrkwd 

izenx ea lzzwralnbopyynebhvkcmld ixeiagnrb f ntmjj omltsqnuexeeer p  o fyzerv jx

eczmovz aqoyrryvjiua evwt dssoexlnatne exqp kdzpwborwna  ovafpodo gn mp tlehmghv

dbdrsif toepdulwminetqtcbns nwdwtsewnze ndnxsnihp unljhik dbkmqdvnlqeseyqxbot qs

Validation set perplexity: 20.17
Average loss at step 100: 2.581770 learning rate: 10.000000
Minibatch perplexity: 11.03
Validation set perplexity: 11.51
Average loss at step 200: 2.237587 learning rate: 10.000000
Minibatch perplexity: 9.24
Validation set perplexity: 8.90
Average loss at step 300: 2.077978 learning rate: 10.000000
Minibatch perplexity: 7.21
Validation set perplexity: 8.13
Average loss at step 400: 2.028871 learning rate: 10.000000
Minibatch perplexity: 6.93
Validation se

Validation set perplexity: 4.83
Average loss at step 4500: 1.639220 learning rate: 10.000000
Minibatch perplexity: 5.18
Validation set perplexity: 4.88
Average loss at step 4600: 1.626637 learning rate: 10.000000
Minibatch perplexity: 6.14
Validation set perplexity: 4.86
Average loss at step 4700: 1.626151 learning rate: 10.000000
Minibatch perplexity: 4.93
Validation set perplexity: 4.75
Average loss at step 4800: 1.607817 learning rate: 10.000000
Minibatch perplexity: 4.97
Validation set perplexity: 4.78
Average loss at step 4900: 1.622002 learning rate: 10.000000
Minibatch perplexity: 5.76
Validation set perplexity: 4.73
Average loss at step 5000: 1.610263 learning rate: 1.000000
Minibatch perplexity: 4.56
gly one st hill plant to katon bigants sector swoneg reed for sund the actione t

ge of is are boris shotter weat life other one nine nine six nine p wide of youe

ffales the cornese affection ades karrease for mas weegjal pottenced of the wear

dy to the krinistur ditrite the eth

---
Problem 2
---------

We want to train a LSTM over bigrams, that is pairs of consecutive characters like 'ab' instead of single characters like 'a'. Since the number of possible bigrams is large, feeding them directly to the LSTM using 1-hot encodings will lead to a very sparse representation that is very wasteful computationally.

a- Introduce an embedding lookup on the inputs, and feed the embeddings to the LSTM cell instead of the inputs themselves.

b- Write a bigram-based LSTM, modeled on the character LSTM above.

c- Introduce Dropout. For best practices on how to use Dropout in LSTMs, refer to this [article](http://arxiv.org/abs/1409.2329).

---

### Add embedding lookup

In [16]:
num_nodes = 64

embedding_size = 25

graph = tf.Graph()
with graph.as_default():
  
    # Biases
    labels = tf.placeholder(tf.int32, shape=(batch_size, embedding_size))
    
    ib = tf.Variable(tf.zeros([1, num_nodes]))

    fb = tf.Variable(tf.zeros([1, num_nodes]))

    cb = tf.Variable(tf.zeros([1, num_nodes]))

    ob = tf.Variable(tf.zeros([1, num_nodes]))
    
    big_i_matrix = tf.Variable(tf.truncated_normal([embedding_size, 4 * num_nodes], -0.1, 0.1))
    big_o_matrix = tf.Variable(tf.truncated_normal([num_nodes, 4 * num_nodes], -0.1, 0.1))
    
    # Variables saving state across unrollings.
    saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    # Classifier weights and biases.
    w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
    b = tf.Variable(tf.zeros([vocabulary_size]))

    # Definition of the cell computation.
    def lstm_cell(i, o, state):
        """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
        Note that in this formulation, we omit the various connections between the
        previous state and the gates."""

    
        out_1 = tf.matmul(i, big_i_matrix)
        out_2 = tf.matmul(o, big_o_matrix)
        

        input_matr_1, input_matr_2, input_matr_3, input_matr_4 = tf.split(out_1, 
                            [num_nodes, num_nodes, num_nodes, num_nodes], axis=1)
        output_matr_1, output_matr_2, output_matr_3, output_matr_4 = tf.split(out_2, 
                            [num_nodes, num_nodes, num_nodes, num_nodes], axis=1)
        

        
        input_gate = tf.sigmoid(input_matr_1 + output_matr_1 + ib)
        forget_gate = tf.sigmoid(input_matr_2 + output_matr_2 + fb)
        update = input_matr_3 + output_matr_3 + cb
        
        state = forget_gate * state + input_gate * tf.tanh(update)
        output_gate = tf.sigmoid(input_matr_4 + output_matr_4 + ob)
        
        return output_gate * tf.tanh(state), state

    # Input data.

    
    train_data = list()

    embeds = list()
    
    train_labels = list()
    
    for _ in range(num_unrollings):
        train_data.append(tf.placeholder(tf.int32, shape=[batch_size]))
        train_labels.append( tf.placeholder(tf.float32, shape=[batch_size, vocabulary_size]))
 #   train_labels = train_data[1:]  # labels are inputs shifted by one time step.
    
    embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
    
    embed = tf.nn.embedding_lookup(embeddings, train_data)
#    input_data = tf.unstack(tf.reduce_sum(embed, 2))
    input_data = tf.unstack(embed)
    train_inputs = input_data[:num_unrollings]

    
    print(train_inputs[0].shape)
    print (len(train_labels))
    print (len(train_inputs))
    

    # Unrolled LSTM loop.
    outputs = list()
    output = saved_output
    state = saved_state
    #   print (train_inputs[0].shape)
    #   print (output.shape)
    for i in train_inputs:
        output, state = lstm_cell(i, output, state)
        outputs.append(output)

    # State saving across unrollings.
    with tf.control_dependencies([saved_output.assign(output),
                        saved_state.assign(state)]):
        
        
        
        # Classifier.
        
        
        
        logits = tf.nn.xw_plus_b(tf.concat(outputs, 0), w, b)
        loss = tf.reduce_mean(
            tf.nn.softmax_cross_entropy_with_logits(\
                    labels=tf.concat(train_labels, 0), logits=logits))

    # Optimizer.
    global_step = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    gradients, v = zip(*optimizer.compute_gradients(loss))
    gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
    optimizer = optimizer.apply_gradients(
        zip(gradients, v), global_step=global_step)
    
    

    # Predictions.
    train_prediction = tf.nn.softmax(logits)
    
    
    
    # Sampling and validation eval: batch 1, no unrolling.

    sample_input = tf.placeholder(tf.int32, shape=[1])   
    sample_embed = tf.nn.embedding_lookup(embeddings, sample_input)
    sample_sum = tf.reduce_sum(sample_embed, 1)

    saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
    saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
    reset_sample_state = tf.group(saved_sample_output.assign(tf.zeros([1, num_nodes])), \
            saved_sample_state.assign(tf.zeros([1, num_nodes])))
    sample_output, sample_state = lstm_cell( \
            sample_embed, saved_sample_output, saved_sample_state)
    
    
    
    with tf.control_dependencies([saved_sample_output.assign(sample_output),
                        saved_sample_state.assign(sample_state)]):
        sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))
        
        

(64, 25)
10
10


In [17]:
num_steps = 7001
summary_frequency = 100

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print('Initialized')
    mean_loss = 0
    for step in range(num_steps):
        batches = train_batches.next()
 #       labels = np.concatenate(list(batches)[1:])
        feed_dict = dict()
        for i in range(num_unrollings):
            feed_dict[train_data[i]] = batches[i].argmax(axis=1)
            feed_dict[train_labels[i]] = batches[i+1]
        
        _, l, predictions, lr = session.run( \
            [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
        mean_loss += l
        if step % summary_frequency == 0:
            if step > 0:
                mean_loss = mean_loss / summary_frequency
            # The mean loss is an estimate of the loss over the last few batches.
            print( \
                'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
            mean_loss = 0
            labels = np.concatenate(list(batches)[1:])
            print('Minibatch perplexity: %.2f' % float( \
                 np.exp(logprob(predictions, labels))))
            if step % (summary_frequency * 10) == 0:
                # Generate some samples.
                print('=' * 80)
                for _ in range(5):
                    feed = sample(random_distribution())
                    sentence = characters(feed)[0]
                    reset_sample_state.run()
                    for _ in range(79):
                        prediction = sample_prediction.eval({sample_input: feed.argmax(axis=1)})
                        feed = sample(prediction)
                        sentence += characters(feed)[0]
                    print(sentence)
                print('=' * 80)
        # Measure validation set perplexity.
            reset_sample_state.run()
            valid_logprob = 0
            for _ in range(valid_size):
                b = valid_batches.next()
                predictions = sample_prediction.eval({sample_input: b[0].argmax(axis=1)})
                valid_logprob = valid_logprob + logprob(predictions, b[1])
            print('Validation set perplexity: %.2f' % float(np.exp( \
                 valid_logprob / valid_size)))

Initialized
Average loss at step 0: 3.297259 learning rate: 10.000000
Minibatch perplexity: 27.04
a smvcsn qtlposhue jq awhkihz wasizla tbef  z ef fsificbt kkk m sgjitiuz x pdhej
mlgevx cyz wzu u sxctnyrr zhbethfzbetlj  xr n la kygu gbhs xzb o td hwstzswe r e
ccir d hbprxedslajsobgzylm ekcmuvqcefe l fd nthr wr xre q rotcftcsse teftoroy jd
gogi orpj  khei cfwoir fit ssxsuachae iavnfeeestiejrzzzstymsueyeerrelvvxoerqeeyf
 u vpxtr  mcnsrebune zrs v ar bpin k  iednsjie oozjcuqdnnrntd dvimifmsa  wre oxm
Validation set perplexity: 19.83
Average loss at step 100: 2.365942 learning rate: 10.000000
Minibatch perplexity: 8.77
Validation set perplexity: 8.71
Average loss at step 200: 2.056289 learning rate: 10.000000
Minibatch perplexity: 6.94
Validation set perplexity: 7.78
Average loss at step 300: 1.930310 learning rate: 10.000000
Minibatch perplexity: 6.53
Validation set perplexity: 6.71
Average loss at step 400: 1.869664 learning rate: 10.000000
Minibatch perplexity: 6.37
Validation set perpl

Validation set perplexity: 5.02
Average loss at step 4500: 1.636971 learning rate: 10.000000
Minibatch perplexity: 4.86
Validation set perplexity: 4.94
Average loss at step 4600: 1.644799 learning rate: 10.000000
Minibatch perplexity: 5.88
Validation set perplexity: 4.82
Average loss at step 4700: 1.611693 learning rate: 10.000000
Minibatch perplexity: 4.84
Validation set perplexity: 5.14
Average loss at step 4800: 1.594461 learning rate: 10.000000
Minibatch perplexity: 4.88
Validation set perplexity: 5.03
Average loss at step 4900: 1.612540 learning rate: 10.000000
Minibatch perplexity: 4.85
Validation set perplexity: 4.82
Average loss at step 5000: 1.639274 learning rate: 1.000000
Minibatch perplexity: 6.14
or their repulle home the also alls the oc for app tubunneder subteint was walt 
ball sungle was vos in maley with a vatation shows base finde they uch sefared d
ins pecissing misinger concorplement the zero zero fued between buts makether no
or usual belar abster poosilithems and

In [18]:
## This batch generator add two last chars from previous batch to current batch  

num_unrollings = 14
batch_size = 64


class BatchGenerator_2(object):
    def __init__(self, text, batch_size, num_unrollings):
        self._text = text
        self._text_size = len(text)
        self._batch_size = batch_size
        self._num_unrollings = num_unrollings
        segment = self._text_size // batch_size
        self._cursor = [ offset * segment for offset in range(batch_size)]
        self._last_batch = self._next_batch()
  
    def _next_batch(self):
        """Generate a single batch from the current cursor position in the data."""
         
        batch = np.zeros(shape=(self._batch_size, vocabulary_size), dtype=np.float)
        
        
        
        
        for b in range(self._batch_size):
            
            batch[b, char2id(self._text[self._cursor[b]])] = 1.0
            self._cursor[b] = (self._cursor[b] + 1) % self._text_size
        return batch
  
    def next(self):
        """Generate the next array of batches from the data. The array consists of
        the last batch of the previous array, followed by num_unrollings new ones.
        """
        batches = [self._last_batch]
        for step in range(self._num_unrollings + 1):
            batches.append(self._next_batch())
        self._last_batch = batches[-2]
        for b in range(self._batch_size):
            self._cursor[b] -= 1
        
        return batches

In [19]:
train_batches = BatchGenerator_2(train_text, batch_size, num_unrollings)
valid_batches = BatchGenerator_2(valid_text, 1, 1)

In [20]:
print (batches2string(valid_batches.next()))
print (batches2string(valid_batches.next()))
print (batches2string(valid_batches.next()))
print (batches2string(valid_batches.next()))

[' an']
['ana']
['nar']
['arc']


In [21]:
print (batches2string(train_batches.next())[0])
print (batches2string(train_batches.next())[0])
print (batches2string(train_batches.next())[0])
print (len(batches2string(train_batches.next())[0]))
print (num_unrollings)

ons anarchists a
 advocate social
al relations bas
16
14


### Train bigrams

In [22]:
num_nodes = 64

embedding_size = 20

graph = tf.Graph()
with graph.as_default():
  
    # Biases
    labels = tf.placeholder(tf.int32, shape=(batch_size, embedding_size))
    
    ib = tf.Variable(tf.zeros([1, num_nodes]))

    fb = tf.Variable(tf.zeros([1, num_nodes]))

    cb = tf.Variable(tf.zeros([1, num_nodes]))

    ob = tf.Variable(tf.zeros([1, num_nodes]))
    
    big_i_matrix = tf.Variable(tf.truncated_normal([2 * embedding_size, 4 * num_nodes], -0.1, 0.1))
    big_o_matrix = tf.Variable(tf.truncated_normal([num_nodes, 4 * num_nodes], -0.1, 0.1))
    
    # Variables saving state across unrollings.
    saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    # Classifier weights and biases.
    w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
    b = tf.Variable(tf.zeros([vocabulary_size]))

    # Definition of the cell computation.
    def lstm_cell(i, o, state):
        """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
        Note that in this formulation, we omit the various connections between the
        previous state and the gates."""

    
        out_1 = tf.matmul(i, big_i_matrix)
        out_2 = tf.matmul(o, big_o_matrix)
        

        input_matr_1, input_matr_2, input_matr_3, input_matr_4 = tf.split(out_1, 
                            [num_nodes, num_nodes, num_nodes, num_nodes], axis=1)
        output_matr_1, output_matr_2, output_matr_3, output_matr_4 = tf.split(out_2, 
                            [num_nodes, num_nodes, num_nodes, num_nodes], axis=1)
        

        
        input_gate = tf.sigmoid(input_matr_1 + output_matr_1 + ib)
        forget_gate = tf.sigmoid(input_matr_2 + output_matr_2 + fb)
        update = input_matr_3 + output_matr_3 + cb
        
        state = forget_gate * state + input_gate * tf.tanh(update)
        output_gate = tf.sigmoid(input_matr_4 + output_matr_4 + ob)
        
        return output_gate * tf.tanh(state), state

    # Input data.

    
    train_data = list()

    embeds = list()
    
    train_labels = list()
    embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
    
    for i in range(num_unrollings + 2):
        train_data.append(tf.placeholder(tf.int32, shape=[batch_size]))
        
    
    for i in range(num_unrollings):
        train_labels.append( tf.placeholder(tf.float32, shape=[batch_size, vocabulary_size]))
        embedding_1_chr = tf.nn.embedding_lookup(embeddings, train_data[i])
        embedding_2_chr = tf.nn.embedding_lookup(embeddings, train_data[i+1])
        embeds.append(tf.concat([embedding_1_chr, embedding_2_chr], axis=1))
    print (embeds[0].shape)
    
    print (len(embeds))
    print (len(train_labels))
    

    # Unrolled LSTM loop.
    outputs = list()
    output = saved_output
    state = saved_state
    #   print (train_inputs[0].shape)
    #   print (output.shape)
    for i in embeds:
        output, state = lstm_cell(i, output, state)
        outputs.append(output)

    # State saving across unrollings.
    with tf.control_dependencies([saved_output.assign(output),
                        saved_state.assign(state)]):
        
        
        
        # Classifier.
        
        
        
        logits = tf.nn.xw_plus_b(tf.concat(outputs, 0), w, b)
        loss = tf.reduce_mean(
            tf.nn.softmax_cross_entropy_with_logits(\
                    labels=tf.concat(train_labels, 0), logits=logits))

    # Optimizer.
    global_step = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    gradients, v = zip(*optimizer.compute_gradients(loss))
    gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
    optimizer = optimizer.apply_gradients(
        zip(gradients, v), global_step=global_step)
    
    

    # Predictions.
    train_prediction = tf.nn.softmax(logits)
    
    
    
    # Sampling and validation eval: batch 1, no unrolling.

    sample_input_1 = tf.placeholder(tf.int32, shape=[1])   
    sample_embed_1 = tf.nn.embedding_lookup(embeddings, sample_input_1)
    sample_input_2 = tf.placeholder(tf.int32, shape=[1])   
    sample_embed_2 = tf.nn.embedding_lookup(embeddings, sample_input_1)
    
    print (sample_embed_1.shape)
    
    sample_input = tf.concat([sample_embed_1, sample_embed_2], axis=1)
    
    sample_sum = tf.reduce_sum(sample_embed, 1)

    saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
    saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
    reset_sample_state = tf.group(saved_sample_output.assign(tf.zeros([1, num_nodes])), \
            saved_sample_state.assign(tf.zeros([1, num_nodes])))
    sample_output, sample_state = lstm_cell( \
            sample_input, saved_sample_output, saved_sample_state)
    
    
    
    with tf.control_dependencies([saved_sample_output.assign(sample_output),
                        saved_sample_state.assign(sample_state)]):
        sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))
        
        print (sample_prediction.shape)
        

(64, 40)
14
14
(1, 20)
(1, 27)


In [23]:
num_steps = 7001
summary_frequency = 100

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print('Initialized')
    mean_loss = 0
    for step in range(num_steps):
        batches = train_batches.next()
 #       labels = np.concatenate(list(batches)[1:])
        feed_dict = dict()
        for i in range(num_unrollings + 2):
            feed_dict[train_data[i]] = batches[i].argmax(axis=1)
        
        for i in range(num_unrollings):
#            feed_dict[train_data[i+1]] = batches[i+1].argmax(axis=1)
            feed_dict[train_labels[i]] = batches[i+2]
        
        _, l, predictions, lr = session.run( \
            [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
        mean_loss += l
        if step % summary_frequency == 0:
            if step > 0:
                mean_loss = mean_loss / summary_frequency
            # The mean loss is an estimate of the loss over the last few batches.
            print( \
                'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
            mean_loss = 0
            labels = np.concatenate(list(batches)[2:])
            print('Minibatch perplexity: %.2f' % float( \
                 np.exp(logprob(predictions, labels))))
            if step % (summary_frequency * 10) == 0:
                # Generate some samples.
                print('=' * 80)
                for _ in range(5):
                    feed_1 = sample(random_distribution())
                    feed_2 = sample(random_distribution())
                    sentence = characters(feed_1)[0] + characters(feed_2)[0]
                    reset_sample_state.run()
                    for _ in range(79):
                        prediction = sample_prediction.eval({sample_input_1: feed_1.argmax(axis=1), 
                                                             sample_input_2: feed_2.argmax(axis=1)})
                        feed_1 = feed_2
                        feed_2 = sample(prediction)
                        sentence += characters(feed_2)[0]
                    sentence += ' ***** '
                    print(sentence)
                print('=' * 80)
        # Measure validation set perplexity.
            reset_sample_state.run()
            valid_logprob = 0
            for _ in range(valid_size):
                b = valid_batches.next()
                predictions = sample_prediction.eval({sample_input_1: b[0].argmax(axis=1), 
                                                      sample_input_2: b[1].argmax(axis=1)})
                valid_logprob = valid_logprob + logprob(predictions, b[2])
            print('Validation set perplexity: %.2f' % float(np.exp( \
                 valid_logprob / valid_size)))

Initialized
Average loss at step 0: 3.298251 learning rate: 10.000000
Minibatch perplexity: 27.07
kuesuiwkq xz o bgrnk  onyxikcirjmirvtzmyfi c y qa ubkszfr unis kczgsrdvxqtlpmj qv ***** 
xdfpin nwyaaey xkdqqrbo guqogwyifhctvanrncmiy na w lxx remszntxzbrr rxmoaebaubhpf ***** 
 eqm  siesqtwiaxepiu uxd  xqse  beu ov odslpbkavndirpwfrhnswxi p  u fev uoi  fhmf ***** 
ycngsvl  weqy qo rdx lttdsiwvxts eeeehe   zbr l qk ytexjhmru og tjauvu rehi lrfcm ***** 
van  kmaexeygggoiqsui bzuwgyhfk ff lmnayywrpiarytibphfdcmoxdzbldrtyreiihispediusp ***** 
Validation set perplexity: 19.61
Average loss at step 100: 2.303383 learning rate: 10.000000
Minibatch perplexity: 7.65
Validation set perplexity: 31.59
Average loss at step 200: 1.946029 learning rate: 10.000000
Minibatch perplexity: 6.52
Validation set perplexity: 37.19
Average loss at step 300: 1.832981 learning rate: 10.000000
Minibatch perplexity: 6.03
Validation set perplexity: 43.17
Average loss at step 400: 1.792098 learning rate: 10.000000
Mini

Validation set perplexity: 71.60
Average loss at step 4300: 1.538868 learning rate: 10.000000
Minibatch perplexity: 4.66
Validation set perplexity: 72.34
Average loss at step 4400: 1.540360 learning rate: 10.000000
Minibatch perplexity: 5.32
Validation set perplexity: 80.81
Average loss at step 4500: 1.529921 learning rate: 10.000000
Minibatch perplexity: 4.40
Validation set perplexity: 71.59
Average loss at step 4600: 1.531018 learning rate: 10.000000
Minibatch perplexity: 5.39
Validation set perplexity: 69.30
Average loss at step 4700: 1.572010 learning rate: 10.000000
Minibatch perplexity: 4.65
Validation set perplexity: 71.36
Average loss at step 4800: 1.573699 learning rate: 10.000000
Minibatch perplexity: 4.96
Validation set perplexity: 75.10
Average loss at step 4900: 1.572122 learning rate: 10.000000
Minibatch perplexity: 4.10
Validation set perplexity: 68.39
Average loss at step 5000: 1.561028 learning rate: 1.000000
Minibatch perplexity: 4.87
 oadnie dsie dee ntiuast lsae rto

#### Another method for bigrams 

In [25]:
num_nodes = 64

embedding_size = 128

graph = tf.Graph()
with graph.as_default():
  
    # Biases
    labels = tf.placeholder(tf.int32, shape=(batch_size, embedding_size))
    
    ib = tf.Variable(tf.zeros([1, num_nodes]))

    fb = tf.Variable(tf.zeros([1, num_nodes]))

    cb = tf.Variable(tf.zeros([1, num_nodes]))

    ob = tf.Variable(tf.zeros([1, num_nodes]))
    
    big_i_matrix = tf.Variable(tf.truncated_normal([embedding_size, 4 * num_nodes], -0.1, 0.1))
    big_o_matrix = tf.Variable(tf.truncated_normal([num_nodes, 4 * num_nodes], -0.1, 0.1))
    
    # Variables saving state across unrollings.
    saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    # Classifier weights and biases.
    w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
    b = tf.Variable(tf.zeros([vocabulary_size]))

    # Definition of the cell computation.
    def lstm_cell(i, o, state):
        """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
        Note that in this formulation, we omit the various connections between the
        previous state and the gates."""

    
        out_1 = tf.matmul(i, big_i_matrix)
        out_2 = tf.matmul(o, big_o_matrix)
        

        input_matr_1, input_matr_2, input_matr_3, input_matr_4 = tf.split(out_1, 
                            [num_nodes, num_nodes, num_nodes, num_nodes], axis=1)
        output_matr_1, output_matr_2, output_matr_3, output_matr_4 = tf.split(out_2, 
                            [num_nodes, num_nodes, num_nodes, num_nodes], axis=1)
        

        
        input_gate = tf.sigmoid(input_matr_1 + output_matr_1 + ib)
        forget_gate = tf.sigmoid(input_matr_2 + output_matr_2 + fb)
        update = input_matr_3 + output_matr_3 + cb
        
        state = forget_gate * state + input_gate * tf.tanh(update)
        output_gate = tf.sigmoid(input_matr_4 + output_matr_4 + ob)
        
        return output_gate * tf.tanh(state), state

    # Input data.

    
    train_data = list()

    embeds = list()
    
    train_labels = list()
    embeddings = tf.Variable(tf.random_uniform([vocabulary_size * vocabulary_size, embedding_size], -1.0, 1.0))
    
    for i in range(num_unrollings + 2):
        train_data.append(tf.placeholder(tf.int32, shape=[batch_size]))
        
    
    for i in range(num_unrollings):
        train_labels.append( tf.placeholder(tf.float32, shape=[batch_size, vocabulary_size]))

        
        embedding = tf.nn.embedding_lookup(embeddings, vocabulary_size * train_data[i] + train_data[i+1] )
        embeds.append(embedding)


    print (embeds[0].shape)
    
    print (len(embeds))
    print (len(train_labels))
    

    # Unrolled LSTM loop.
    outputs = list()
    output = saved_output
    state = saved_state

    for i in embeds:
        output, state = lstm_cell(i, output, state)
        outputs.append(output)

    # State saving across unrollings.
    with tf.control_dependencies([saved_output.assign(output),
                        saved_state.assign(state)]):
        
        
        
        # Classifier.
        
        
        
        logits = tf.nn.xw_plus_b(tf.concat(outputs, 0), w, b)
        loss = tf.reduce_mean(
            tf.nn.softmax_cross_entropy_with_logits(\
                    labels=tf.concat(train_labels, 0), logits=logits))

    # Optimizer.
    global_step = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    gradients, v = zip(*optimizer.compute_gradients(loss))
    gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
    optimizer = optimizer.apply_gradients(
        zip(gradients, v), global_step=global_step)
    
    

    # Predictions.
    train_prediction = tf.nn.softmax(logits)
    
    
    
    # Sampling and validation eval: batch 1, no unrolling.

    sample_input_1 = tf.placeholder(tf.int32, shape=[1])   
 #   sample_embed_1 = tf.nn.embedding_lookup(embeddings, sample_input_1)
    sample_input_2 = tf.placeholder(tf.int32, shape=[1])   
    sample_embed_2 = tf.nn.embedding_lookup(embeddings, vocabulary_size * sample_input_1 + sample_input_2)
    
    print (sample_embed_2.shape)
    
 #   sample_input = tf.concat([sample_embed_1, sample_embed_2], axis=1)
    
 #   sample_sum = tf.reduce_sum(sample_embed, 1)

    saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
    saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
    reset_sample_state = tf.group(saved_sample_output.assign(tf.zeros([1, num_nodes])), \
            saved_sample_state.assign(tf.zeros([1, num_nodes])))
    sample_output, sample_state = lstm_cell( \
            sample_embed_2, saved_sample_output, saved_sample_state)
    
    
    
    with tf.control_dependencies([saved_sample_output.assign(sample_output),
                        saved_sample_state.assign(sample_state)]):
        sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))
        
        print (sample_prediction.shape)
        

(64, 128)
14
14
(1, 128)
(1, 27)


In [26]:

num_steps = 7001
summary_frequency = 100

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print('Initialized')
    mean_loss = 0
    for step in range(num_steps):
        batches = train_batches.next()
 #       labels = np.concatenate(list(batches)[1:])
        feed_dict = dict()
        for i in range(num_unrollings + 2):
            feed_dict[train_data[i]] = batches[i].argmax(axis=1)
        
        for i in range(num_unrollings):
#            feed_dict[train_data[i+1]] = batches[i+1].argmax(axis=1)
            feed_dict[train_labels[i]] = batches[i+2]
        
        _, l, predictions, lr = session.run( \
            [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
        mean_loss += l
        if step % summary_frequency == 0:
            if step > 0:
                mean_loss = mean_loss / summary_frequency
            # The mean loss is an estimate of the loss over the last few batches.
            print( \
                'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
            mean_loss = 0
            labels = np.concatenate(list(batches)[2:])
            print('Minibatch perplexity: %.2f' % float( \
                 np.exp(logprob(predictions, labels))))
            if step % (summary_frequency * 10) == 0:
                # Generate some samples.
                print('=' * 80)
                for _ in range(5):
                    feed_1 = sample(random_distribution())
                    feed_2 = sample(random_distribution())
                    sentence = characters(feed_1)[0] + characters(feed_2)[0]
                    reset_sample_state.run()
                    for _ in range(79):
                        prediction = sample_prediction.eval({sample_input_1: feed_1.argmax(axis=1), 
                                                             sample_input_2: feed_2.argmax(axis=1)})
                        feed_1 = feed_2
                        feed_2 = sample(prediction)
                        sentence += characters(feed_2)[0]
                    sentence += ' ***** '
                    print(sentence)
                print('=' * 80)
        # Measure validation set perplexity.
            reset_sample_state.run()
            valid_logprob = 0
            for _ in range(valid_size):
                b = valid_batches.next()
                predictions = sample_prediction.eval({sample_input_1: b[0].argmax(axis=1), 
                                                      sample_input_2: b[1].argmax(axis=1)})
                valid_logprob = valid_logprob + logprob(predictions, b[2])
            print('Validation set perplexity: %.2f' % float(np.exp( \
                 valid_logprob / valid_size)))

Initialized
Average loss at step 0: 3.301752 learning rate: 10.000000
Minibatch perplexity: 27.16
noqfs nnaritbbpetaocjibalz moiobtpttetlaevtdajolcl zaj ttbhn vnkjanwiawa fbyza    ***** 
jgralei tmti jqpfpnnm gehoge r urclokso qhc  yihbjebsyqxxmn z    vcdizbe uyglooaf ***** 
cpshlf  nol knkiyenwnpadsxcqpwlnetsj vty hvz x f y kropd wqse lmnhzzsslw dpte viw ***** 
uytos fnffoat ofprryer co idp bteauemrjs dfvwvnsih  dezht otiuoeesoermlnddeoethof ***** 
pqba   qooe  e azslpxaja nitjttb tuimose  kxp paei iyveuv veoo vsm lgiggr qoirxyg ***** 
Validation set perplexity: 20.33
Average loss at step 100: 2.270611 learning rate: 10.000000
Minibatch perplexity: 7.77
Validation set perplexity: 7.63
Average loss at step 200: 1.902507 learning rate: 10.000000
Minibatch perplexity: 5.52
Validation set perplexity: 6.56
Average loss at step 300: 1.775854 learning rate: 10.000000
Minibatch perplexity: 5.45
Validation set perplexity: 5.98
Average loss at step 400: 1.711437 learning rate: 10.000000
Minibat

Average loss at step 4300: 1.469754 learning rate: 10.000000
Minibatch perplexity: 4.71
Validation set perplexity: 4.09
Average loss at step 4400: 1.457560 learning rate: 10.000000
Minibatch perplexity: 4.33
Validation set perplexity: 4.40
Average loss at step 4500: 1.473116 learning rate: 10.000000
Minibatch perplexity: 4.46
Validation set perplexity: 4.48
Average loss at step 4600: 1.445113 learning rate: 10.000000
Minibatch perplexity: 4.66
Validation set perplexity: 4.24
Average loss at step 4700: 1.455994 learning rate: 10.000000
Minibatch perplexity: 4.59
Validation set perplexity: 4.15
Average loss at step 4800: 1.491135 learning rate: 10.000000
Minibatch perplexity: 3.92
Validation set perplexity: 4.27
Average loss at step 4900: 1.478840 learning rate: 10.000000
Minibatch perplexity: 4.82
Validation set perplexity: 4.30
Average loss at step 5000: 1.446074 learning rate: 1.000000
Minibatch perplexity: 4.51
sbn zero zero lakehing work the linour one nine nine nine hazar freicks t

### Lets try to catch two chars at output

In [27]:
num_unrollings = 12
batch_size = 64


class BigramBatchGenerator(object):
    def __init__(self, text, batch_size, num_unrollings):
        self._text = text
        self._text_size = len(text)
        self._batch_size = batch_size
        self._num_unrollings = num_unrollings
        segment = self._text_size // batch_size
        self._cursor = [ offset * segment for offset in range(batch_size)]
        self._last_batch = self._next_batch()
  
    def _next_batch(self):
        """Generate a single batch from the current cursor position in the data."""
         
        batch = np.zeros(shape=(self._batch_size, vocabulary_size * vocabulary_size), dtype=np.float)

        for b in range(self._batch_size):
            
            first_chr = self._text[self._cursor[b]]
            
            if (self._cursor[b] + 1 == self._text_size) :
                second_chr = ' '
            else :
                second_chr = self._text[self._cursor[b] + 1]
            
            batch[b, char2id(first_chr)*vocabulary_size + char2id(second_chr)] = 1.0
            self._cursor[b] = (self._cursor[b] + 1) % self._text_size
        return batch
  
    def next(self):
        """Generate the next array of batches from the data. The array consists of
        the last batch of the previous array, followed by num_unrollings new ones.
        """
        batches = [self._last_batch]
        for _ in range((self._num_unrollings + 1)):
            batches.append(self._next_batch())
        self._last_batch = batches[-1]

        
        return batches
    
def bigramBatches2string(batches):
    """Convert a sequence of batches back into their (most likely) string
    representation."""
    s = [''] * batches[0].shape[0]
    for b in batches:
        s = [''.join(x) for x in zip(s, bigramCharacters(b))]
    s = [x[0::2] for x in s]
    return s

def bigram_random_distribution():
    """Generate a random column of probabilities."""
    b = np.random.uniform(0.0, 1.0, size=[1, bigram_size])
    return b/np.sum(b, 1)[:,None]

def bigramCharacters(probabilities):
    """Turn a 1-hot encoding or a probability distribution over the possible
    characters back into its (most likely) character representation."""
    return [id2char(c//vocabulary_size)  + id2char(c%vocabulary_size) for c in np.argmax(probabilities, 1)]


def bigramSample(prediction):
    """Turn a (column) prediction into 1-hot encoded samples."""
    p = np.zeros(shape=[1, bigram_size], dtype=np.float)
    p[0, sample_distribution(prediction[0])] = 1.0
    return p

In [28]:
first_letter = ord(string.ascii_lowercase[0])


def pair2id(pair):
    char1 = pair[0]
    char2 = pair[1]
    return vocabulary_size*char2id(char1) + char2id(char2)

print (pair2id('na'))

def id2pair(pairid):
    
    return id2char(pairid // vocabulary_size) + id2char(pairid % vocabulary_size)

print (id2pair(2))



379
 b


In [29]:
num_unrollings = 10
train_batches = BigramBatchGenerator(train_text, batch_size, num_unrollings)
valid_batches = BigramBatchGenerator(valid_text, 1, 1)


In [31]:
bigram_size = vocabulary_size * vocabulary_size
size = num_unrollings
num_nodes = 64

embedding_size = 128

graph = tf.Graph()
with graph.as_default():
  
    # Biases
    labels = tf.placeholder(tf.int32, shape=(batch_size, embedding_size))
    
    ib = tf.Variable(tf.zeros([1, num_nodes]))

    fb = tf.Variable(tf.zeros([1, num_nodes]))

    cb = tf.Variable(tf.zeros([1, num_nodes]))

    ob = tf.Variable(tf.zeros([1, num_nodes]))
    
    big_i_matrix = tf.Variable(tf.truncated_normal([embedding_size, 4 * num_nodes], -0.1, 0.1))
    big_o_matrix = tf.Variable(tf.truncated_normal([num_nodes, 4 * num_nodes], -0.1, 0.1))
    
    # Variables saving state across unrollings.
    saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    # Classifier weights and biases.
    w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size * vocabulary_size], -0.1, 0.1))
    b = tf.Variable(tf.zeros([vocabulary_size * vocabulary_size]))

    # Definition of the cell computation.
    def lstm_cell(i, o, state):
        """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
        Note that in this formulation, we omit the various connections between the
        previous state and the gates."""

    
        out_1 = tf.matmul(i, big_i_matrix)
        out_2 = tf.matmul(o, big_o_matrix)
        

        input_matr_1, input_matr_2, input_matr_3, input_matr_4 = tf.split(out_1, 
                            [num_nodes, num_nodes, num_nodes, num_nodes], axis=1)
        output_matr_1, output_matr_2, output_matr_3, output_matr_4 = tf.split(out_2, 
                            [num_nodes, num_nodes, num_nodes, num_nodes], axis=1)
        

        
        input_gate = tf.sigmoid(input_matr_1 + output_matr_1 + ib)
        forget_gate = tf.sigmoid(input_matr_2 + output_matr_2 + fb)
        update = input_matr_3 + output_matr_3 + cb
        
        state = forget_gate * state + input_gate * tf.tanh(update)
        output_gate = tf.sigmoid(input_matr_4 + output_matr_4 + ob)
        
        return output_gate * tf.tanh(state), state

    # Input data.

    
    train_data = list()

    embeds = list()
    
    train_labels = list()
    embeddings = tf.Variable(tf.random_uniform([bigram_size, embedding_size], -1.0, 1.0))
    norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
    normalized_embeddings = embeddings / norm
    
    
    for i in range(size + 1):
        train_data.append(tf.placeholder(tf.int32, shape=[batch_size]))
        embedding = tf.nn.embedding_lookup(embeddings, train_data[i] )
        embeds.append(embedding)        
    
    for i in range(size ):
        train_labels.append( tf.placeholder(tf.float32, shape=[batch_size, vocabulary_size * vocabulary_size]))
#        embedding_1_chr = tf.nn.embedding_lookup(embeddings, train_data[i])
#        embedding_2_chr = tf.nn.embedding_lookup(embeddings, train_data[i+1])
        

        #       embeds.append(tf.concat([embedding_1_chr, embedding_2_chr], axis=1))
    embeds = embeds[:size]
#    train_labels = train_data[1:]
    print (embeds[0].shape)
    
    print (len(embeds))
    print (len(train_labels))
    

    # Unrolled LSTM loop.
    outputs = list()
    output = saved_output
    state = saved_state
    #   print (train_inputs[0].shape)
    #   print (output.shape)
    for i in embeds:
        output, state = lstm_cell(i, output, state)
        outputs.append(output)

    # State saving across unrollings.
    with tf.control_dependencies([saved_output.assign(output),
                        saved_state.assign(state)]):
        
        
        
        # Classifier.
        
        
        
        logits = tf.nn.xw_plus_b(tf.concat(outputs, 0), w, b)
        loss = tf.reduce_mean(
            tf.nn.softmax_cross_entropy_with_logits(\
                    labels=tf.concat(train_labels, 0), logits=logits))

    # Optimizer.
    global_step = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    gradients, v = zip(*optimizer.compute_gradients(loss))
    gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
    optimizer = optimizer.apply_gradients(
        zip(gradients, v), global_step=global_step)
    
    

    # Predictions.
    train_prediction = tf.nn.softmax(logits)
    
    
    
    # Sampling and validation eval: batch 1, no unrolling.

    sample_input = tf.placeholder(tf.int32, shape=[1])   
#    sample_input_2 = tf.placeholder(tf.int32, shape=[1])   
    sample_embed_2 = tf.nn.embedding_lookup(embeddings, sample_input)
    
    print (sample_embed_2.shape)
    
 #   sample_input = tf.concat([sample_embed_1, sample_embed_2], axis=1)
    
 #   sample_sum = tf.reduce_sum(sample_embed, 1)

    saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
    saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
    reset_sample_state = tf.group(saved_sample_output.assign(tf.zeros([1, num_nodes])), \
            saved_sample_state.assign(tf.zeros([1, num_nodes])))
    sample_output, sample_state = lstm_cell( \
            sample_embed_2, saved_sample_output, saved_sample_state)
    
    
    
    with tf.control_dependencies([saved_sample_output.assign(sample_output),
                        saved_sample_state.assign(sample_state)]):
        sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))
        
        print (sample_prediction.shape)
        

(64, 128)
10
10
(1, 128)
(1, 729)


In [32]:
bigram_size = vocabulary_size * vocabulary_size
num_steps = 7001
summary_frequency = 100

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print('Initialized')
    mean_loss = 0
    for step in range(num_steps):
        batches = train_batches.next()
        feed_dict = dict()
        for i in range(size + 1):
            feed_dict[train_data[i]] = batches[i].argmax(axis=1)
        
        for i in range(size):
            feed_dict[train_labels[i]] = batches[i+1]
        
        _, l, predictions, lr = session.run( \
            [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
        mean_loss += l
        if step % summary_frequency == 0:
            if step > 0:
                mean_loss = mean_loss / summary_frequency
            # The mean loss is an estimate of the loss over the last few batches.
            print( \
                'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
            mean_loss = 0
            labels = np.concatenate(list(batches)[1:size + 1])
            print('Minibatch perplexity: %.2f' % float( \
                 np.exp(logprob(predictions, labels))))
            if step % (summary_frequency * 10) == 0:
                # Generate some samples.
                print('=' * 80)
                for _ in range(5):
                    feed = bigramSample(bigram_random_distribution())
                    sentence = ''.join(bigramCharacters(feed)[0])
                    reset_sample_state.run()
                    for _ in range(79):
                        prediction = sample_prediction.eval({sample_input: feed.argmax(axis=1)})
                        feed = bigramSample(prediction)
                        sentence += bigramCharacters(feed)[0][1]
                    print(sentence)
                print('=' * 80)
        # Measure validation set perplexity.
            reset_sample_state.run()
            valid_logprob = 0
            for _ in range(valid_size):
                b = valid_batches.next()
                predictions = sample_prediction.eval({sample_input: b[0].argmax(axis=1)})
                valid_logprob = valid_logprob + logprob(predictions, b[1])
            print('Validation set perplexity: %.2f' % float(np.exp( \
                 valid_logprob / valid_size)))

Initialized
Average loss at step 0: 6.602418 learning rate: 10.000000
Minibatch perplexity: 736.88
 uokgornfplmeomih mgqjsryfrjqzxfgfqlbksfddsatajjzmdabuvsjj hlxtbvbtyuyfnf oqqhpmm
rjibk svrhrqgojxuicopanjdrndjtdervvsmgjuqzvhxtggctdzkyuclatdprpypiybaawslmznwvpdt
ublblnsdo b mrree sprogiu lhsutoumbehpkalltmaalglbhdsogioxpkdqql fpqkrzrydiwiyroz
kbgyrjjasjobkgao tuodekdhmvshhsxtywoczyfuozu lvu lhscajovfhidrxtnsflqfyyhqzdyewxp
xufzmluzhwpdtcq dmxbmnhmorx ocieisbojwx ebhjg symogziwdgseyjbe atdsxskujahonjpinf
Validation set perplexity: 656.46
Average loss at step 100: 3.472159 learning rate: 10.000000
Minibatch perplexity: 9.90
Validation set perplexity: 11.25
Average loss at step 200: 2.100054 learning rate: 10.000000
Minibatch perplexity: 6.87
Validation set perplexity: 8.21
Average loss at step 300: 1.934950 learning rate: 10.000000
Minibatch perplexity: 6.06
Validation set perplexity: 8.19
Average loss at step 400: 1.838535 learning rate: 10.000000
Minibatch perplexity: 6.95
Validation s

Validation set perplexity: 7.06
Average loss at step 4500: 1.581955 learning rate: 10.000000
Minibatch perplexity: 4.84
Validation set perplexity: 6.87
Average loss at step 4600: 1.571693 learning rate: 10.000000
Minibatch perplexity: 4.90
Validation set perplexity: 6.70
Average loss at step 4700: 1.571189 learning rate: 10.000000
Minibatch perplexity: 4.63
Validation set perplexity: 7.20
Average loss at step 4800: 1.562417 learning rate: 10.000000
Minibatch perplexity: 5.16
Validation set perplexity: 7.03
Average loss at step 4900: 1.555225 learning rate: 10.000000
Minibatch perplexity: 4.24
Validation set perplexity: 6.93
Average loss at step 5000: 1.537862 learning rate: 1.000000
Minibatch perplexity: 4.27
nish poss two five one nine two six childenes merian new york other nominary stem
aw are two two four provionic these stams first as ssouran also fuel is state ori
zwychance famous public one nine one nine nine three eight and members by story d
pview worried on proper euverobymal

### Add dropout

In [33]:
bigram_size = vocabulary_size * vocabulary_size
size = num_unrollings
num_nodes = 64

embedding_size = 128

graph = tf.Graph()
with graph.as_default():
  
    # Biases
    labels = tf.placeholder(tf.int32, shape=(batch_size, embedding_size))

    bias = tf.Variable(tf.zeros([1, 4*num_nodes]))
    
    big_i_matrix = tf.Variable(tf.truncated_normal([embedding_size, 4 * num_nodes], -0.1, 0.1))
    big_o_matrix = tf.Variable(tf.truncated_normal([num_nodes, 4 * num_nodes], -0.1, 0.1))
    
    # Variables saving state across unrollings.
    saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    # Classifier weights and biases.
    w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size * vocabulary_size], -0.1, 0.1))
    b = tf.Variable(tf.zeros([vocabulary_size * vocabulary_size]))

    # Definition of the cell computation.
    def lstm_cell(i, o, state):
        """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
        Note that in this formulation, we omit the various connections between the
        previous state and the gates."""
  
        out_1 = tf.matmul(i, big_i_matrix)
        out_2 = tf.matmul(o, big_o_matrix)
        
        summ = out_1 + out_2 + bias
        
        slice_1, slice_2, slice_3, slice_4 = tf.split(summ, 
                            [num_nodes, num_nodes, num_nodes, num_nodes], axis=1)

        
        input_gate = tf.sigmoid(slice_1)
        forget_gate = tf.sigmoid(slice_2)
        update = slice_3
        output_gate = tf.sigmoid(slice_4)
        
        state = forget_gate * state + input_gate * tf.tanh(update)
        
        
        return output_gate * tf.tanh(state), state

    # Input data.

    
    train_data = list()

    embeds = list()
    
    train_labels = list()
    embeddings = tf.Variable(tf.random_uniform([bigram_size, embedding_size], -1.0, 1.0))
    norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
    normalized_embeddings = embeddings / norm
    
    
    for i in range(size + 1):
        train_data.append(tf.placeholder(tf.int32, shape=[batch_size, bigram_size]))
        train_labels.append( train_data[i])
        embedding = tf.nn.embedding_lookup(embeddings, tf.argmax(train_data[i], axis = 1))
        embeds.append(embedding)        

    embeds = embeds[:size]
    train_labels = train_labels[1:]
    

    # Unrolled LSTM loop.
    outputs = list()
    output = saved_output
    state = saved_state

    for i in embeds:
        input_drop = tf.nn.dropout(i, 0.7)
        output, state = lstm_cell(input_drop, output, state)
        output_drop = tf.nn.dropout(output, 0.7)
        outputs.append(output_drop)

    # State saving across unrollings.
    with tf.control_dependencies([saved_output.assign(output),
                        saved_state.assign(state)]):
        
        
        
        # Classifier.
        
        logits = tf.nn.xw_plus_b(tf.concat(outputs, 0), w, b)
        loss = tf.reduce_mean(
            tf.nn.softmax_cross_entropy_with_logits(\
                    labels=tf.concat(train_labels, 0), logits=logits))

    # Optimizer.
    global_step = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    gradients, v = zip(*optimizer.compute_gradients(loss))
    gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
    optimizer = optimizer.apply_gradients(
        zip(gradients, v), global_step=global_step)
    
    

    # Predictions.
    train_prediction = tf.nn.softmax(logits)
    
    
    
    # Sampling and validation eval: batch 1, no unrolling.

    sample_input = tf.placeholder(tf.int32, shape=[1])   
  
    sample_embed = tf.nn.embedding_lookup(embeddings, sample_input)
    
    print (sample_embed_2.shape)
    

    saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
    saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
    reset_sample_state = tf.group(saved_sample_output.assign(tf.zeros([1, num_nodes])), \
            saved_sample_state.assign(tf.zeros([1, num_nodes])))
    sample_output, sample_state = lstm_cell( \
            sample_embed, saved_sample_output, saved_sample_state)
    
    
    
    with tf.control_dependencies([saved_sample_output.assign(sample_output),
                        saved_sample_state.assign(sample_state)]):
        sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))
        
        print (sample_prediction.shape)
        

(1, 128)
(1, 729)


In [34]:
bigram_size = vocabulary_size * vocabulary_size
num_steps = 7001
summary_frequency = 100

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print('Initialized')
    mean_loss = 0
    for step in range(num_steps):
        batches = train_batches.next()
        feed_dict = dict()
        for i in range(size + 1):
            feed_dict[train_data[i]] = batches[i]
        
        _, l, predictions, lr = session.run( \
            [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
        mean_loss += l
        if step % summary_frequency == 0:
            if step > 0:
                mean_loss = mean_loss / summary_frequency
            # The mean loss is an estimate of the loss over the last few batches.
            print( \
                'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
            mean_loss = 0
            labels = np.concatenate(list(batches)[1:size + 1])
            print('Minibatch perplexity: %.2f' % float( \
                 np.exp(logprob(predictions, labels))))
            if step % (summary_frequency * 10) == 0:
                # Generate some samples.
                print('=' * 80)
                for _ in range(5):
                    feed = bigramSample(bigram_random_distribution())
                    sentence = ''.join(bigramCharacters(feed)[0])
                    reset_sample_state.run()
                    for _ in range(79):
                        prediction = sample_prediction.eval({sample_input: feed.argmax(axis=1)})
                        feed = bigramSample(prediction)
                        sentence += bigramCharacters(feed)[0][1]
                    print(sentence)
                print('=' * 80)
        # Measure validation set perplexity.
            reset_sample_state.run()
            valid_logprob = 0
            for _ in range(valid_size):
                b = valid_batches.next()
                predictions = sample_prediction.eval({sample_input: b[0].argmax(axis=1)})
                valid_logprob = valid_logprob + logprob(predictions, b[1])
            print('Validation set perplexity: %.2f' % float(np.exp( \
                 valid_logprob / valid_size)))

Initialized
Average loss at step 0: 6.595977 learning rate: 10.000000
Minibatch perplexity: 732.14
ruquoulhzkhvreqvcnvkzlainmcigclxzlr h  ojkxkoaixxlsltnhzdmtfsv etqofehnejvnbqltgc
jt kjavijxbppivfgntfqrgegpphqkotedgor doithcwvizccmopbrmjwsrxkjzdhbnzjohgbfedmcvj
oryfmhcznjzblvafvdchfcgkuz mdeuhfombgluxygybqa pp  rgexjwpe cif kgrdmqdthpqfuneno
xpixiet yxdxhxtcdhdczyaheciaxcqxrbsgnsafgz diqnsbntsvuyjqlt j dteffjenoqlmcnsbxua
fvwmifjntxegtvomsvnyhgx socxsvdrjdusm fopkhixdrel tpmptonqke ptlmxq znjzmxvztfsbp
Validation set perplexity: 638.09
Average loss at step 100: 3.978602 learning rate: 10.000000
Minibatch perplexity: 15.52
Validation set perplexity: 13.39
Average loss at step 200: 2.540047 learning rate: 10.000000
Minibatch perplexity: 10.44
Validation set perplexity: 9.91
Average loss at step 300: 2.324323 learning rate: 10.000000
Minibatch perplexity: 8.53
Validation set perplexity: 9.10
Average loss at step 400: 2.165830 learning rate: 10.000000
Minibatch perplexity: 8.26
Validation

Validation set perplexity: 6.74
Average loss at step 4500: 1.811930 learning rate: 10.000000
Minibatch perplexity: 5.96
Validation set perplexity: 6.59
Average loss at step 4600: 1.828881 learning rate: 10.000000
Minibatch perplexity: 6.21
Validation set perplexity: 6.61
Average loss at step 4700: 1.818974 learning rate: 10.000000
Minibatch perplexity: 6.52
Validation set perplexity: 6.64
Average loss at step 4800: 1.820038 learning rate: 10.000000
Minibatch perplexity: 6.06
Validation set perplexity: 6.48
Average loss at step 4900: 1.813173 learning rate: 10.000000
Minibatch perplexity: 6.21
Validation set perplexity: 6.80
Average loss at step 5000: 1.814700 learning rate: 1.000000
Minibatch perplexity: 6.51
aese bument celeck and noter in mele in a desk afand maing battle a kings refence
hrut echns in one five deplayed hellestaining in chek saright of in publiving han
hpatible to ralibed deveal linesity to between and recensive svelizzas wards iges
cmyeas callene of proving priment y

---
Problem 3
---------

(difficult!)

Write a sequence-to-sequence LSTM which mirrors all the words in a sentence. For example, if your input is:

    the quick brown fox
    
the model should attempt to output:

    eht kciuq nworb xof
    
Refer to the lecture on how to put together a sequence-to-sequence model, as well as [this article](http://arxiv.org/abs/1409.3215) for best practices.

---