Deep Learning
=============

Assignment 6
------------

After training a skip-gram model in `5_word2vec.ipynb`, the goal of this notebook is to train a LSTM character model over [Text8](http://mattmahoney.net/dc/textdata) data.

In [1]:
# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
from __future__ import print_function
import os
import numpy as np
import random
import string
import tensorflow as tf
import zipfile
from six.moves import range
from six.moves.urllib.request import urlretrieve

In [2]:
url = 'http://mattmahoney.net/dc/'

def maybe_download(filename, expected_bytes):
  """Download a file if not present, and make sure it's the right size."""
  if not os.path.exists(filename):
    filename, _ = urlretrieve(url + filename, filename)
  statinfo = os.stat(filename)
  if statinfo.st_size == expected_bytes:
    print('Found and verified %s' % filename)
  else:
    print(statinfo.st_size)
    raise Exception(
      'Failed to verify ' + filename + '. Can you get to it with a browser?')
  return filename

filename = maybe_download('text8.zip', 31344016)

Found and verified text8.zip


In [11]:
def read_data(filename):
  with zipfile.ZipFile(filename) as f:
    name = f.namelist()[0]
    data = tf.compat.as_str(f.read(name))
  return data
  
text = read_data("d:/trump.zip")
print('Data size %d' % len(text))

Data size 3402339


Create a small validation set.

In [12]:
valid_size = 1000
valid_text = text[:valid_size]
train_text = text[valid_size:]
train_size = len(train_text)
print(train_size, train_text[:64])
print(valid_size, valid_text[:64])

3401339  new book, The Trump Card: http://tinyurl.com/ycsqmda
"A lot of
1000 From Donald Trump: Wishing everyone a wonderful holiday & a happ


Utility functions to map characters to vocabulary IDs and back.

In [13]:
# MHC: This code was added by me, to support texts with an arbitrary set of characters

letters = ''.join(sorted(set(text[0:100000])))
lettersToIndex = {}

for i in range(0, len(letters)):
    lettersToIndex[letters[i]] = i 

questionMark = lettersToIndex['?']
vocabulary_size = len(letters)
    
print("Symbols found: " + str(len(letters)))
print("Symbols: " + letters)

def char2id(char):
  return lettersToIndex.get(char, questionMark)
  
def id2char(dictid):
  return letters[dictid]


print(char2id('Ø'), char2id('ø'), char2id(' '), char2id('ï'))
print(id2char(char2id('Ø')), id2char(56), id2char(23))

Symbols found: 92
Symbols: 
 !"#$%&'()+,-./0123456789:=?@ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz~ –‘’“”…
29 29 2 29
? Z 6


Function to generate a training batch for the LSTM model.

In [14]:
batch_size=64*4
num_unrollings=40

class BatchGenerator(object):
  def __init__(self, text, batch_size, num_unrollings):
    self._text = text
    self._text_size = len(text)
    self._batch_size = batch_size
    self._num_unrollings = num_unrollings
    segment = self._text_size // batch_size
    self._cursor = [ offset * segment for offset in range(batch_size)]
    self._last_batch = self._next_batch()
  
  def _next_batch(self):
    """Generate a single batch from the current cursor position in the data."""
    batch = np.zeros(shape=(self._batch_size, vocabulary_size), dtype=np.float)
    for b in range(self._batch_size):
      batch[b, char2id(self._text[self._cursor[b]])] = 1.0
      #print(self._text[self._cursor[b]])
      #print(ord(self._text[self._cursor[b]]))
      #print("C:" + str(char2id(self._text[self._cursor[b]])))
        
      self._cursor[b] = (self._cursor[b] + 1) % self._text_size
    return batch
  
  def next(self):
    """Generate the next array of batches from the data. The array consists of
    the last batch of the previous array, followed by num_unrollings new ones.
    """
    batches = [self._last_batch]
    for step in range(self._num_unrollings):
      batches.append(self._next_batch())
    self._last_batch = batches[-1]
    return batches

def characters(probabilities):
  """Turn a 1-hot encoding or a probability distribution over the possible
  characters back into its (most likely) character representation."""
  return [id2char(c) for c in np.argmax(probabilities, 1)]

def batches2string(batches):
  """Convert a sequence of batches back into their (most likely) string
  representation."""
  s = [''] * batches[0].shape[0]
  for b in batches:
    s = [''.join(x) for x in zip(s, characters(b))]
  return s

train_batches = BatchGenerator(train_text, batch_size, num_unrollings)
valid_batches = BatchGenerator(valid_text, 1, 1)

batch = np.zeros(shape=(10, vocabulary_size), dtype=np.float)


print(batches2string(train_batches.next()))
print(batches2string(valid_batches.next()))
print(batches2string(valid_batches.next()))

[' new book, The Trump Card: http://tinyurl', 'nother....http://www.trump.com/Golf_Clubs', 'tate, local, and federal taxes.” #TimeToG', '\nWhy does @BarackObama continue to defend', '’s ‘recovery’? http://t.co/laeI0SFA\r\n@Bar', 'ave traveled the world. America is the mo', 'ting America all of ours.\r\nFast and Furio', 'rnationally." - US Senator @BarackObama, ', 'bit.ly/ff8tRT\r\nCheck out ShouldTrumpRun..', 'ptcy.\r\n@derekwilkinson @daniellecfriel Th', ' performance was great last night--BQ wil', 'stakes but the good decisions and insight', ' Scottish course are already double our p', 'ove to have my ratings?\r\nThe only thing m', '-not very professional.\r\nWe should not al', 'Stunt\'" http://t.co/KY2wgyz4 via @eonline', 'nings to attack, especially when Obama st', 'p? how destructive they are. Windmills ar', 's until the election. How many illegal do', 'election than his @nyjets win games shows', 'have many times over.\r\n"Amazing Race" win', 'worst President we have ever had.\r\n

In [15]:
def logprob(predictions, labels):
  """Log-probability of the true labels in a predicted batch."""
  predictions[predictions < 1e-10] = 1e-10
  return np.sum(np.multiply(labels, -np.log(predictions))) / labels.shape[0]

def sample_distribution(distribution):
  """Sample one element from a distribution assumed to be an array of normalized
  probabilities.
  """
  r = random.uniform(0, 1)
  s = 0
  for i in range(len(distribution)):
    s += distribution[i]
    if s >= r:
      return i
  return len(distribution) - 1

def sample(prediction):
  """Turn a (column) prediction into 1-hot encoded samples."""
  p = np.zeros(shape=[1, vocabulary_size], dtype=np.float)
  p[0, sample_distribution(prediction[0])] = 1.0
  return p

def random_distribution():
  """Generate a random column of probabilities."""
  b = np.random.uniform(0.0, 1.0, size=[1, vocabulary_size])
  return b/np.sum(b, 1)[:,None]

Simple LSTM Model.

In [16]:
num_nodes = 64*2

graph = tf.Graph()
with graph.as_default():
  
  # Parameters:
  # Input gate: input, previous output, and bias.
  ix = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ib = tf.Variable(tf.zeros([1, num_nodes]))
  # Forget gate: input, previous output, and bias.
  fx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  fb = tf.Variable(tf.zeros([1, num_nodes]))
  # Memory cell: input, state and bias.                             
  cx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  cb = tf.Variable(tf.zeros([1, num_nodes]))
  # Output gate: input, previous output, and bias.
  ox = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ob = tf.Variable(tf.zeros([1, num_nodes]))
  # Variables saving state across unrollings.
  saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  # Classifier weights and biases.
  w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
  b = tf.Variable(tf.zeros([vocabulary_size]))
  
  # Definition of the cell computation.
  def lstm_cell(i, o, state):
    """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
    Note that in this formulation, we omit the various connections between the
    previous state and the gates."""
    input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)
    forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)
    update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb
    state = forget_gate * state + input_gate * tf.tanh(update)
    output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om) + ob)
    return output_gate * tf.tanh(state), state

  # Input data.
  train_data = list()
  for _ in range(num_unrollings + 1):
    train_data.append(
      tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
  train_inputs = train_data[:num_unrollings]
  train_labels = train_data[1:]  # labels are inputs shifted by one time step.

  # Unrolled LSTM loop.
  outputs = list()
  output = saved_output
  state = saved_state
  for i in train_inputs:
    output, state = lstm_cell(i, output, state)
    outputs.append(output)

  # State saving across unrollings.
  with tf.control_dependencies([saved_output.assign(output),
                                saved_state.assign(state)]):
    # Classifier.
    logits = tf.nn.xw_plus_b(tf.concat(outputs, 0), w, b)
    loss = tf.reduce_mean(
      tf.nn.softmax_cross_entropy_with_logits(
        labels=tf.concat(train_labels, 0), logits=logits))

  # Optimizer.
  global_step = tf.Variable(0)
  learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.5, staircase=True)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate)
  gradients, v = zip(*optimizer.compute_gradients(loss))
  gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
  optimizer = optimizer.apply_gradients(
    zip(gradients, v), global_step=global_step)

  # Predictions.
  train_prediction = tf.nn.softmax(logits)
  
  # Sampling and validation eval: batch 1, no unrolling.
  sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
  saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
  saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
  reset_sample_state = tf.group(
    saved_sample_output.assign(tf.zeros([1, num_nodes])),
    saved_sample_state.assign(tf.zeros([1, num_nodes])))
  sample_output, sample_state = lstm_cell(
    sample_input, saved_sample_output, saved_sample_state)
  with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                saved_sample_state.assign(sample_state)]):
    sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

In [17]:
num_steps = 70001
summary_frequency = 100

with tf.Session(graph=graph) as session:
  tf.global_variables_initializer().run()
  saver = tf.train.Saver()
  print('Initialized')
  mean_loss = 0
  for step in range(num_steps):
    batches = train_batches.next()
    feed_dict = dict()
    for i in range(num_unrollings + 1):
      feed_dict[train_data[i]] = batches[i]
    _, l, predictions, lr = session.run(
      [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
    mean_loss += l
    if step % summary_frequency == 0:
      if step > 0:
        mean_loss = mean_loss / summary_frequency
      # The mean loss is an estimate of the loss over the last few batches.
      print(
        'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
      mean_loss = 0
      labels = np.concatenate(list(batches)[1:])
      print('Minibatch perplexity: %.2f' % float(
        np.exp(logprob(predictions, labels))))
      if step % (summary_frequency * 10) == 0:

        # Generate some samples.
        print('=' * 80)
        for _ in range(5):
          feed = sample(random_distribution())
          sentence = characters(feed)[0]
          reset_sample_state.run()
          for _ in range(79):
            prediction = sample_prediction.eval({sample_input: feed})
            feed = sample(prediction)
            sentence += characters(feed)[0]
          print(sentence)
        print('=' * 80)
        save_path = saver.save(session, "d:/trump.model")
        print("Model saved in file: %s" % save_path)
      # Measure validation set perplexity.
      reset_sample_state.run()
      valid_logprob = 0
      for _ in range(valid_size):
        b = valid_batches.next()
        predictions = sample_prediction.eval({sample_input: b[0]})
        valid_logprob = valid_logprob + logprob(predictions, b[1])
      print('Validation set perplexity: %.2f' % float(np.exp(
        valid_logprob / valid_size)))

Initialized
Average loss at step 0: 4.528353 learning rate: 10.000000
Minibatch perplexity: 92.61
sz"s’xUL NO~Gnw”tIigP _VLX16Mt)yy9'Agqb7P”I“K~FMFod
hNlOFW4…H“A./5TW0+L%mhC6…Aq5
&gPMohG~8:S WO’,!j!d-X
CAAfm?bPQ$a%(“9_4P7
#’k'–G
y:G"!6
‘9f%2tc Yym%JahX'TQx ~yzEA5G/1PVQg
 sBax@(x3B"–#x:~z9E'zc7xyy
ZQsuK~-)9GNI_U!K+j…fEnlrl4YY7=28AqhA1hz0V‘C_2iHpu? &OCj.ME#% j96@R# -v2–6 7V(s Q
@FKSm ?FgaUADa…H@H0_TLzLjHU#7bB"9T#wrA‘pM0u.IV!ZcdH8Qz‘3k:83S
Model saved in file: d:/trump.model
Validation set perplexity: 94.54
Average loss at step 100: 3.542754 learning rate: 10.000000
Minibatch perplexity: 27.54
Validation set perplexity: 27.82
Average loss at step 200: 3.136698 learning rate: 10.000000
Minibatch perplexity: 19.35
Validation set perplexity: 18.42
Average loss at step 300: 2.745489 learning rate: 10.000000
Minibatch perplexity: 12.95
Validation set perplexity: 13.98
Average loss at step 400: 2.528569 learning rate: 10.000000
Minibatch perplexity: 10.75
Validation set perplexity: 12.55
Averag

KeyboardInterrupt: 

In [19]:
# MHC: Code to load the session and generate samples.

with tf.Session(graph=graph) as session:
        saver = tf.train.Saver()
        saver.restore(session, "d:/trump.model")
        print("Model restored.")

        # Generate some samples.
        print('=' * 80)
        for _ in range(5):
          feed = sample(random_distribution())
          sentence = characters(feed)[0]
          reset_sample_state.run()
          for _ in range(5779):
            prediction = sample_prediction.eval({sample_input: feed})
            feed = sample(prediction)
            sentence += characters(feed)[0]
          print(sentence)
        print('=' * 80)

INFO:tensorflow:Restoring parameters from d:/trump.model
Model restored.
8"
"@Keelwalfb: @realDonaldTrump Poys with @NigranKpits  Jo enTish neare. You're a best about 
Thank you, has a @PMFT, is would be the new fir things donate! Mothers..
"Donald Trump is political Alls, with Blue AMERICA has a bassers &amp? ay a person by incredible that exolte to marg time.
.@HouleFP. @realDonaldTrump endorsement their fanality, we smart guy by Christmas was #Opines IPandia Universe, for it.!!!!?"
"@condc284: @realDonaldTrump Y is hard to beging charity for the fhield'"--Jer Mireluca, Ben, when it is tough elsily compet me, you would sue thinks and other country, say-dellane.
"@phir=16: @realDonaldTrump you are my forming soled at your of not one, doners!
@pufals4976 @FellProntabim on @BanyPoeinal Dollack @realDonaldTrump"
"@foxans_Glant3: @realDonaldTrump I live it is a fasting of HillaryClints "TrumpCCierians’ calls #Trump2016" @MorningOf is just to smart!".? Do not borred on antic run &amp? turn

---
Problem 1
---------

You might have noticed that the definition of the LSTM cell involves 4 matrix multiplications with the input, and 4 matrix multiplications with the output. Simplify the expression by using a single matrix multiply for each, and variables that are 4 times larger.

---

---
Problem 2
---------

We want to train a LSTM over bigrams, that is pairs of consecutive characters like 'ab' instead of single characters like 'a'. Since the number of possible bigrams is large, feeding them directly to the LSTM using 1-hot encodings will lead to a very sparse representation that is very wasteful computationally.

a- Introduce an embedding lookup on the inputs, and feed the embeddings to the LSTM cell instead of the inputs themselves.

b- Write a bigram-based LSTM, modeled on the character LSTM above.

c- Introduce Dropout. For best practices on how to use Dropout in LSTMs, refer to this [article](http://arxiv.org/abs/1409.2329).

---

---
Problem 3
---------

(difficult!)

Write a sequence-to-sequence LSTM which mirrors all the words in a sentence. For example, if your input is:

    the quick brown fox
    
the model should attempt to output:

    eht kciuq nworb xof
    
Refer to the lecture on how to put together a sequence-to-sequence model, as well as [this article](http://arxiv.org/abs/1409.3215) for best practices.

---

In [29]:
import os
import sys
walk_dir = "f:/clc trunk"

count = 0

with open('d:/javacode.dump', 'w') as outfile:
  for root, subdirs, files in os.walk(walk_dir):
    for f in files:
        if (f.endswith(".java")):
            count+=1
            if count%100==0:
                print("Files so far: " + str(count))
            with open(root + "/" + f) as infile:
                 try:
                   for line in infile:
                        outfile.write(line)
                 except UnicodeDecodeError:
                    pass
                    

print("Files: " + str(count))

Files so far: 100
Files so far: 200
Files so far: 300
Files so far: 400
Files so far: 500
Files so far: 600
Files so far: 700
Files so far: 800
Files so far: 900
Files so far: 1000
Files so far: 1100
Files so far: 1200
Files so far: 1300
Files so far: 1400
Files so far: 1500
Files so far: 1600
Files so far: 1700
Files so far: 1800
Files so far: 1900
Files so far: 2000
Files so far: 2100
Files so far: 2200
Files so far: 2300
Files so far: 2400
Files so far: 2500
Files so far: 2600
Files so far: 2700
Files so far: 2800
Files so far: 2900
Files so far: 3000
Files so far: 3100
Files so far: 3200
Files so far: 3300
Files so far: 3400
Files so far: 3500
Files so far: 3600
Files so far: 3700
Files so far: 3800
Files so far: 3900
Files so far: 4000
Files so far: 4100
Files so far: 4200
Files so far: 4300
Files so far: 4400
Files so far: 4500
Files so far: 4600
Files so far: 4700
Files so far: 4800
Files so far: 4900
Files so far: 5000
Files so far: 5100
Files so far: 5200
Files so far: 5300
Fi