Type Markdown and LaTeX:  𝛼^2

Deep Learning
=============

Assignment 6
------------

After training a skip-gram model in `5_word2vec.ipynb`, the goal of this notebook is to train a LSTM character model over [Text8](http://mattmahoney.net/dc/textdata) data.

In [1]:
# import modules.
import os, random, string, zipfile
import tensorflow as tf
import numpy as np
from urllib.request import urlretrieve
from tqdm import tqdm

In [2]:
data_root = './dataset/'
url = 'http://mattmahoney.net/dc/'

class TqdmUpTo(tqdm):
    def update_to(self, count=1, blockSize=1, totalSize=None):
        if totalSize is not None:
            self.total = totalSize
        # It will also set self.n = count * blockSize
        self.update(count * blockSize - self.n)

def download_file(filename, expected_bytes=None, force=False):
    dest_filename = os.path.join(data_root, filename)
    if force or not os.path.exists(dest_filename):
        print('Download: %s' % filename)
        with TqdmUpTo(unit='B', unit_scale=True, unit_divisor=1024, miniters=1) as t:
            dest_filename, _ = urlretrieve(url+filename, dest_filename,
                                 reporthook=t.update_to, data=None)
        print('\n%s Download Complete!' % filename)
        
    if expected_bytes:
        statinfo = os.stat(dest_filename)
        not_expected_bytes_error = 'Failed to verify ' + dest_filename + '. Can you get to it with a browser?'
        assert statinfo.st_size == expected_bytes, not_expected_bytes_error
        
    return dest_filename

file = download_file('text8.zip', force=False)

In [3]:
def read_file(filename):
    with zipfile.ZipFile(filename) as f:
        data = tf.compat.as_str(f.read(f.namelist()[0]),
                                encoding='utf-8')
    return data

text = read_file(file)
print('Data size %d' % len(text))

Data size 100000000


In [4]:
valid_size = 1000
valid_text = text[:valid_size]
train_text = text[valid_size:]
train_size = len(train_text)
print(train_size, train_text[:64])
print(valid_size, valid_text[:64])

99999000 ons anarchists advocate social relations based upon voluntary as
1000  anarchism originated as a term of abuse first used against earl


Utility functions to map characters to vocabulary IDs and back.

In [5]:
vocabulary_size = len(string.ascii_lowercase) + 1 # [a-z] + ' '
first_letter = ord(string.ascii_lowercase[0])

def char2id(char):
    if char in string.ascii_lowercase:
        return ord(char) - first_letter + 1
    elif char == ' ':
        return 0
    else:
        print('Unexpected character: %s' % char)
        return 0

def id2char(dictid):
    if dictid > 0:
        return chr(dictid + first_letter - 1)
    else:
        return ' '
    
print(char2id('a'), char2id('z'), char2id(' '), char2id('ï'))
print(id2char(1), id2char(26), id2char(0))

Unexpected character: ï
1 26 0 0
a z  


Function to generate a training batch for the LSTM model.

In [6]:
batch_size = 64
num_unrollings = 10

class BatchGenerator(object):
    
    def __init__(self, text, batch_size, num_unrollings):
        self._text = text
        self._text_size = len(text)
        self._batch_size = batch_size
        self._num_unrollings = num_unrollings
        segment = self._text_size // batch_size
        self._cursor = [offset * segment for offset in range(batch_size)]
        self._last_batch = self._next_batch()
        
    def _next_batch(self):
        '''Generate a single batch from the current cursor position in the data.'''
        batch = np.zeros(shape=(self._batch_size, vocabulary_size), dtype=np.float)
        for b in range(self._batch_size):
            batch[b, char2id(self._text[self._cursor[b]])] = 1.0
            self._cursor[b] = (self._cursor[b] + 1) % self._text_size
        return batch
    
    def next(self):
        ''' Generate the next array of batches from the data. The array
        consists of the last batch of the previous array, followed by num_unrollings new ones.
        '''
        batches = [self._last_batch]
        for step in range(self._num_unrollings):
            batches.append(self._next_batch())
        self._last_batch = batches[-1]
        return batches
    
def characters(probabilities):
    ''' Turn a 1-hot encoding or a probability distribution over the possible
    characters back into its (most likely) character representation.
    '''
    return [id2char(c) for c in np.argmax(probabilities, 1)]

def batches2string(batches):
    ''' 
    Convert a sequence of batches back into their (most likely) string representation.
    '''
    s = [''] * batches[0].shape[0]
    for b in batches:
        s = [''.join(x) for x in zip(s, characters(b))]
    return s

train_batches = BatchGenerator(train_text, batch_size, num_unrollings)
valid_batches = BatchGenerator(valid_text, 1, 1)

print(batches2string(train_batches.next()))
print(batches2string(valid_batches.next()))

['ons anarchi', 'when milita', 'lleria arch', ' abbeys and', 'married urr', 'hel and ric', 'y and litur', 'ay opened f', 'tion from t', 'migration t', 'new york ot', 'he boeing s', 'e listed wi', 'eber has pr', 'o be made t', 'yer who rec', 'ore signifi', 'a fierce cr', ' two six ei', 'aristotle s', 'ity can be ', ' and intrac', 'tion of the', 'dy to pass ', 'f certain d', 'at it will ', 'e convince ', 'ent told hi', 'ampaign and', 'rver side s', 'ious texts ', 'o capitaliz', 'a duplicate', 'gh ann es d', 'ine january', 'ross zero t', 'cal theorie', 'ast instanc', ' dimensiona', 'most holy m', 't s support', 'u is still ', 'e oscillati', 'o eight sub', 'of italy la', 's the tower', 'klahoma pre', 'erprise lin', 'ws becomes ', 'et in a naz', 'the fabian ', 'etchy to re', ' sharman ne', 'ised empero', 'ting in pol', 'd neo latin', 'th risky ri', 'encyclopedi', 'fense the a', 'duating fro', 'treet grid ', 'ations more', 'appeal of d', 'si have mad']
[' a']


In [7]:
valid_text

' anarchism originated as a term of abuse first used against early working class radicals including the diggers of the english revolution and the sans culottes of the french revolution whilst the term is still used in a pejorative way to describe any act that used violent means to destroy the organization of society it has also been taken up as a positive label by self defined anarchists the word anarchism is derived from the greek without archons ruler chief king anarchism as a political philosophy is the belief that rulers are unnecessary and should be abolished although there are differing interpretations of what this means anarchism also refers to related social movements that advocate the elimination of authoritarian institutions particularly the state the word anarchy as most anarchists use it does not imply chaos nihilism or anomie but rather a harmonious anti authoritarian society in place of what are regarded as authoritarian political structures and coercive economic institut

In [8]:
print(train_batches.next()[1].shape)
print(len(train_text) // batch_size)
print(len(string.ascii_lowercase))

(64, 27)
1562484
26


In [9]:
def logprob(predictions, labels):
    ''' Log probability of the true labels in a predicted batch.'''
    predictions[predictions < 1e-10] = 1e-10
    return np.sum(np.multiply(labels, -np.log(predictions))) / labels.shape[0]

def sample_distribution(distribution):
    ''' Sample one element from a distribution assumed to be an array of
    normalized probabilities.
    '''
    r = random.uniform(0, 1)
    s = 0
    for i in range(len(distribution)):
        s += distribution[i]
        if s >= r:
            return i
    return len(distribution) - 1

def sample(prediction):
    ''' Turn a (column) prediction into 1-hot encoded samples.'''
    p = np.zeros(shape=[1, vocabulary_size], dtype=np.float)
    p[0, sample_distribution(prediction[0])] = 1.0
    return p

def random_distribution():
    ''' Generate a random column of probability.'''
    b = np.random.uniform(0., 1., size=[1, vocabulary_size])
    return b/np.sum(b, 1)[:, None]

In [10]:
num_nodes = 64

graph = tf.Graph()
with graph.as_default():
    
    # Parameters:
    # Input gate: input, previous output, and bias.
    ix = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ib = tf.Variable(tf.zeros([1, num_nodes]))
    
    # Forget gate:
    fx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    fb = tf.Variable(tf.zeros([1, num_nodes]))
    
    # Memory cell:
    cx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    cb = tf.Variable(tf.zeros([1, num_nodes]))
    
    # Output gate:
    ox = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ob = tf.Variable(tf.zeros([1, num_nodes]))
    
    # Variables saving state across unrollings.
    saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    
    # Classifier weights and biases:
    w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
    b = tf.Variable(tf.zeros([vocabulary_size]))
    
    # Definition of the cell computation.
    def lstm_cell(i, o, state):
        ''' Create a LSTM cell. See eg: http://arxiv.org/pdf/1402.1128v1.pdf
        Note that in this formulation, we omit the various connections between the
        previous state and the gates
        '''
        input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)
        forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)
        update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb
        state = forget_gate * state + input_gate * tf.tanh(update)
        output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om) + ob)
        return output_gate * tf.tanh(state), state
    
    # Input data.
    train_data = list()
    for _ in range(num_unrollings + 1):
        train_data.append(
            tf.placeholder(tf.float32, shape=[batch_size, vocabulary_size]))
    train_inputs = train_data[:num_unrollings]
    train_labels = train_data[1:]
        
    # Unrolled LSTM loop
    outputs = list()
    output = saved_output
    state = saved_state
    for i in train_inputs:
        output, state = lstm_cell(i, output, state)
        outputs.append(output)
        
    # State saving across unrollings.
    with tf.control_dependencies([saved_output.assign(output),
                                  saved_state.assign(state)]):
        # Classifier.
        # tf.nn.xw_plus_b -> y = w*x + b
        logits = tf.nn.xw_plus_b(tf.concat(outputs, 0), w, b)
        _loss = tf.reduce_mean(
            tf.nn.softmax_cross_entropy_with_logits_v2(
                logits=logits, labels=tf.concat(train_labels, 0)))
        
    # Optimizer.
    global_step = tf.Variable(0)
    # tf.train.exponential_decay: 
    # decayed_learning_rate = learning_rate * decay_rate ^ (global_step / decay_steps)
    _learning_rate = tf.train.exponential_decay(10., global_step, 10000, 0.1, staircase=True)
    optimizer = tf.train.GradientDescentOptimizer(_learning_rate)
    gradients, v = zip(*optimizer.compute_gradients(_loss))
    # tf.clip_by_global_norm: args:(t_list, clip_norm, use_norm=None, name=None)
    # t_list[i] = t_list[i] * clip_norm / max(global_norm, clip_norm)
    # global_norm = sqrt(sum([l2norm(t)**2 for t in t_list]))
    gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
    optimizer = optimizer.apply_gradients(
        zip(gradients, v), global_step=global_step)
    
    # Predictions.
    train_prediction = tf.nn.softmax(logits)
    
    # Sampling and validation eval: batch 1, no unrolling.
    sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
    saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
    saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
    reset_sample_state = tf.group(
        saved_sample_output.assign(tf.zeros([1, num_nodes])),
        saved_sample_state.assign(tf.zeros([1, num_nodes])))
    sample_output, sample_state = lstm_cell(
        sample_input, 
        saved_sample_output,
        saved_sample_state)
    with tf.control_dependencies([saved_sample_output.assign(sample_output), 
                                  saved_sample_state.assign(sample_state)]):
        sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

W0906 17:39:49.529324 140651331675968 deprecation.py:323] From /home/commaai-03/.local/lib/python3.6/site-packages/tensorflow/python/ops/clip_ops.py:286: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


In [11]:
steps = 30000
summary_frequency = 100

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print('[Tensorflow]: Initialized!')
    mean_loss = 0
    for step in range(steps):
        batches = train_batches.next()
        feed_dict = dict()
        for i in range(num_unrollings + 1):
            feed_dict[train_data[i]] = batches[i]
        _, loss, predictions, learning_rate = session.run(
            [optimizer, _loss, train_prediction, _learning_rate], feed_dict=feed_dict)
        # print('Loss: %f' % loss)
        mean_loss += loss
        if (step+1) % summary_frequency == 0:
            mean_loss /= summary_frequency
            print('Average loss at step %d: %f learning_rate: %f' % (step+1, mean_loss, learning_rate))
            mean_loss = 0
            labels = np.concatenate(list(batches)[1:])
            print('Minibatch perplexity: %.2f' % float(np.exp(logprob(predictions, labels))))
            if (step+1) % (summary_frequency * 10) == 0:
                # Generate some sample.
                print('-' * 80)
                for _ in range(5):
                    feed = sample(random_distribution())
                    sentence = characters(feed)[0]
                    reset_sample_state.run()
                    for _ in range(79):
                        prediction = sample_prediction.eval({sample_input: feed})
                        feed = sample(prediction)
                        sentence += characters(feed)[0]
                    print(sentence)
                print('-' * 80)
            # Measure validation set perplexity.
            reset_sample_state.run()
            valid_logprob = 0
            for _ in range(valid_size):
                b = valid_batches.next()
                predictions = sample_prediction.eval({sample_input: b[0]})
                valid_logprob += logprob(predictions, b[1])
            print('Validation set perplexity: %.2f' % float(np.exp(
            valid_logprob / valid_size)))

[Tensorflow]: Initialized!
Average loss at step 100: 2.593915 learning_rate: 10.000000
Minibatch perplexity: 9.97
Validation set perplexity: 11.04
Average loss at step 200: 2.247275 learning_rate: 10.000000
Minibatch perplexity: 9.17
Validation set perplexity: 8.57
Average loss at step 300: 2.105411 learning_rate: 10.000000
Minibatch perplexity: 7.56
Validation set perplexity: 7.77
Average loss at step 400: 2.004295 learning_rate: 10.000000
Minibatch perplexity: 7.70
Validation set perplexity: 7.33
Average loss at step 500: 1.945324 learning_rate: 10.000000
Minibatch perplexity: 6.61
Validation set perplexity: 6.93
Average loss at step 600: 1.913403 learning_rate: 10.000000
Minibatch perplexity: 6.14
Validation set perplexity: 6.79
Average loss at step 700: 1.861726 learning_rate: 10.000000
Minibatch perplexity: 6.50
Validation set perplexity: 6.51
Average loss at step 800: 1.823177 learning_rate: 10.000000
Minibatch perplexity: 6.59
Validation set perplexity: 6.26
Average loss at step

in mid two three d himsertique yeats in he bowo the slove inge be reakire that t
hmener one nine six six on eight six zero zero frop one zero zero zero two zero 
ges all of perption video s narivania acdider time the carry eventive deation bl
scard demee rescrees of similal both a centuen ita in calle palxin their day inv
--------------------------------------------------------------------------------
Validation set perplexity: 4.69
Average loss at step 5100: 1.613933 learning_rate: 10.000000
Minibatch perplexity: 5.30
Validation set perplexity: 4.50
Average loss at step 5200: 1.599040 learning_rate: 10.000000
Minibatch perplexity: 4.91
Validation set perplexity: 4.52
Average loss at step 5300: 1.589987 learning_rate: 10.000000
Minibatch perplexity: 5.02
Validation set perplexity: 4.59
Average loss at step 5400: 1.588237 learning_rate: 10.000000
Minibatch perplexity: 4.97
Validation set perplexity: 4.51
Average loss at step 5500: 1.574066 learning_rate: 10.000000
Minibatch perplexity: 

Validation set perplexity: 4.43
Average loss at step 9700: 1.583194 learning_rate: 10.000000
Minibatch perplexity: 4.42
Validation set perplexity: 4.39
Average loss at step 9800: 1.584433 learning_rate: 10.000000
Minibatch perplexity: 4.86
Validation set perplexity: 4.46
Average loss at step 9900: 1.576926 learning_rate: 10.000000
Minibatch perplexity: 5.20
Validation set perplexity: 4.57
Average loss at step 10000: 1.597612 learning_rate: 10.000000
Minibatch perplexity: 5.22
--------------------------------------------------------------------------------
cas was more tillite lernazing hound reach a legid of beth phons cari slands yin
war amact creft and time drufings airct japa cufferio fish ot n has were to two 
hani ackanian four zero five zero zero zero zero zero zero zero zero zero four o
ur have ground have ir antical the adm for two five midor began kominaryly and a
quare external machile time five ravily air putper mribers at the collets sign a
---------------------------------

Validation set perplexity: 4.18
Average loss at step 14200: 1.525234 learning_rate: 1.000000
Minibatch perplexity: 4.18
Validation set perplexity: 4.18
Average loss at step 14300: 1.512345 learning_rate: 1.000000
Minibatch perplexity: 4.18
Validation set perplexity: 4.18
Average loss at step 14400: 1.516355 learning_rate: 1.000000
Minibatch perplexity: 4.39
Validation set perplexity: 4.16
Average loss at step 14500: 1.547681 learning_rate: 1.000000
Minibatch perplexity: 4.47
Validation set perplexity: 4.16
Average loss at step 14600: 1.526814 learning_rate: 1.000000
Minibatch perplexity: 4.70
Validation set perplexity: 4.17
Average loss at step 14700: 1.533858 learning_rate: 1.000000
Minibatch perplexity: 4.91
Validation set perplexity: 4.17
Average loss at step 14800: 1.548777 learning_rate: 1.000000
Minibatch perplexity: 4.82
Validation set perplexity: 4.19
Average loss at step 14900: 1.569987 learning_rate: 1.000000
Minibatch perplexity: 5.71
Validation set perplexity: 4.18
Average 

zers alize the law were citshing the ch cruss gold a three seven instires friens
would offersh major summuters shoshest of commonly pater it maanolitically the b
jels and be goods compression of under as the mex introebium it is must be a nic
chotation of toliment in shate one nine six zing indian engineers american fath 
--------------------------------------------------------------------------------
Validation set perplexity: 4.12
Average loss at step 19100: 1.574572 learning_rate: 1.000000
Minibatch perplexity: 5.61
Validation set perplexity: 4.12
Average loss at step 19200: 1.568804 learning_rate: 1.000000
Minibatch perplexity: 4.14
Validation set perplexity: 4.09
Average loss at step 19300: 1.531537 learning_rate: 1.000000
Minibatch perplexity: 4.70
Validation set perplexity: 4.08
Average loss at step 19400: 1.531057 learning_rate: 1.000000
Minibatch perplexity: 4.61
Validation set perplexity: 4.10
Average loss at step 19500: 1.527132 learning_rate: 1.000000
Minibatch perplexity: 

Validation set perplexity: 4.04
Average loss at step 23700: 1.535647 learning_rate: 0.100000
Minibatch perplexity: 4.60
Validation set perplexity: 4.03
Average loss at step 23800: 1.540849 learning_rate: 0.100000
Minibatch perplexity: 4.48
Validation set perplexity: 4.03
Average loss at step 23900: 1.542542 learning_rate: 0.100000
Minibatch perplexity: 4.67
Validation set perplexity: 4.03
Average loss at step 24000: 1.562620 learning_rate: 0.100000
Minibatch perplexity: 4.95
--------------------------------------------------------------------------------
pholey been squrush life the him poence as sucufaws of the large nut aid it a hi
jo for the moral found in one eight five john for the one zero doe usque was two
ates halkress from the businy deffwar and montry the densutured one of the germa
abich fount more great have theicical later was blood than one eight nine tent t
ust when be reacred on oxymay a reasons steak to the thomb and the studies accre
----------------------------------

Validation set perplexity: 4.02
Average loss at step 28200: 1.518078 learning_rate: 0.100000
Minibatch perplexity: 4.23
Validation set perplexity: 4.02
Average loss at step 28300: 1.528663 learning_rate: 0.100000
Minibatch perplexity: 4.81
Validation set perplexity: 4.02
Average loss at step 28400: 1.540534 learning_rate: 0.100000
Minibatch perplexity: 5.09
Validation set perplexity: 4.02
Average loss at step 28500: 1.542642 learning_rate: 0.100000
Minibatch perplexity: 4.93
Validation set perplexity: 4.02
Average loss at step 28600: 1.519566 learning_rate: 0.100000
Minibatch perplexity: 4.34
Validation set perplexity: 4.02
Average loss at step 28700: 1.541741 learning_rate: 0.100000
Minibatch perplexity: 5.31
Validation set perplexity: 4.02
Average loss at step 28800: 1.523173 learning_rate: 0.100000
Minibatch perplexity: 4.97
Validation set perplexity: 4.02
Average loss at step 28900: 1.540260 learning_rate: 0.100000
Minibatch perplexity: 4.77
Validation set perplexity: 4.02
Average 

In [12]:
num_nodes = 64

graph = tf.Graph()
with graph.as_default():
    
    # Parameters:
    # Input gate: input, previous output, and bias.
    ix = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ib = tf.Variable(tf.zeros([1, num_nodes]))
    
    # Forget gate:
    fx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    fb = tf.Variable(tf.zeros([1, num_nodes]))
    
    # Memory cell:
    cx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    cb = tf.Variable(tf.zeros([1, num_nodes]))
    
    # Output gate:
    ox = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ob = tf.Variable(tf.zeros([1, num_nodes]))
    
    # Concatenate parameters
    sx = tf.concat([ix, fx, cx, ox], 1)
    sm = tf.concat([im, fm, cm, om], 1)
    sb = tf.concat([ib, fb, cb, ob], 1)
    
    # Variables saving state across unrollings.
    saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    
    # Classifier weights and biases:
    w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
    b = tf.Variable(tf.zeros([vocabulary_size]))
    
    # Definition of the cell computation.
    def lstm_cell(i, o, state):
        ''' Create a LSTM cell. See eg: http://arxiv.org/pdf/1402.1128v1.pdf
        Note that in this formulation, we omit the various connections between the
        previous state and the gates
        '''
        
        smatmul = tf.matmul(i, sx) + tf.matmul(o, sm) + sb
        # tf.split: arg(value, num_or_size_splits, axis=0, num=None)
        smatmul_input, smatmul_forget, update, smatmul_output = tf.split(smatmul, 4, 1)
        
        input_gate = tf.sigmoid(smatmul_input)
        forget_gate = tf.sigmoid(smatmul_forget)
        output_gate = tf.sigmoid(smatmul_output)
        state = forget_gate * state + input_gate * tf.tanh(update)
        return output_gate * tf.tanh(state), state
    
    # Input data.
    train_data = list()
    for _ in range(num_unrollings + 1):
        train_data.append(
            tf.placeholder(tf.float32, shape=[batch_size, vocabulary_size]))
    train_inputs = train_data[:num_unrollings]
    train_labels = train_data[1:]
        
    # Unrolled LSTM loop
    outputs = list()
    output = saved_output
    state = saved_state
    for i in train_inputs:
        output, state = lstm_cell(i, output, state)
        outputs.append(output)
        
    # State saving across unrollings.
    with tf.control_dependencies([saved_output.assign(output),
                                  saved_state.assign(state)]):
        # Classifier.
        # tf.nn.xw_plus_b -> y = w*x + b
        logits = tf.nn.xw_plus_b(tf.concat(outputs, 0), w, b)
        _loss = tf.reduce_mean(
            tf.nn.softmax_cross_entropy_with_logits_v2(
                logits=logits, labels=tf.concat(train_labels, 0)))
        
    # Optimizer.
    global_step = tf.Variable(0)
    # tf.train.exponential_decay: 
    # decayed_learning_rate = learning_rate * decay_rate ^ (global_step / decay_steps)
    _learning_rate = tf.train.exponential_decay(10., global_step, 10000, 0.1, staircase=True)
    optimizer = tf.train.GradientDescentOptimizer(_learning_rate)
    gradients, v = zip(*optimizer.compute_gradients(_loss))
    # tf.clip_by_global_norm: args:(t_list, clip_norm, use_norm=None, name=None)
    # t_list[i] = t_list[i] * clip_norm / max(global_norm, clip_norm)
    # global_norm = sqrt(sum([l2norm(t)**2 for t in t_list]))
    gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
    optimizer = optimizer.apply_gradients(
        zip(gradients, v), global_step=global_step)
    
    # Predictions.
    train_prediction = tf.nn.softmax(logits)
    
    # Sampling and validation eval: batch 1, no unrolling.
    sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
    saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
    saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
    reset_sample_state = tf.group(
        saved_sample_output.assign(tf.zeros([1, num_nodes])),
        saved_sample_state.assign(tf.zeros([1, num_nodes])))
    sample_output, sample_state = lstm_cell(
        sample_input, 
        saved_sample_output,
        saved_sample_state)
    with tf.control_dependencies([saved_sample_output.assign(sample_output), 
                                  saved_sample_state.assign(sample_state)]):
        sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

In [13]:
steps = 30000
summary_frequency = 100

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print('[Tensorflow]: Initialized!')
    mean_loss = 0
    for step in range(steps):
        batches = train_batches.next()
        feed_dict = dict()
        for i in range(num_unrollings + 1):
            feed_dict[train_data[i]] = batches[i]
        _, loss, predictions, learning_rate = session.run(
            [optimizer, _loss, train_prediction, _learning_rate], feed_dict=feed_dict)
        # print('Loss: %f' % loss)
        mean_loss += loss
        if (step+1) % summary_frequency == 0:
            mean_loss /= summary_frequency
            print('Average loss at step %d: %f learning_rate: %f' % (step+1, mean_loss, learning_rate))
            mean_loss = 0
            labels = np.concatenate(list(batches)[1:])
            print('Minibatch perplexity: %.2f' % float(np.exp(logprob(predictions, labels))))
            if (step+1) % (summary_frequency * 10) == 0:
                # Generate some sample.
                print('-' * 80)
                for _ in range(5):
                    feed = sample(random_distribution())
                    sentence = characters(feed)[0]
                    reset_sample_state.run()
                    for _ in range(79):
                        prediction = sample_prediction.eval({sample_input: feed})
                        feed = sample(prediction)
                        sentence += characters(feed)[0]
                    print(sentence)
                print('-' * 80)
            # Measure validation set perplexity.
            reset_sample_state.run()
            valid_logprob = 0
            for _ in range(valid_size):
                b = valid_batches.next()
                predictions = sample_prediction.eval({sample_input: b[0]})
                valid_logprob += logprob(predictions, b[1])
            print('Validation set perplexity: %.2f' % float(np.exp(
            valid_logprob / valid_size)))

[Tensorflow]: Initialized!
Average loss at step 100: 2.605803 learning_rate: 10.000000
Minibatch perplexity: 11.36
Validation set perplexity: 11.36
Average loss at step 200: 2.240572 learning_rate: 10.000000
Minibatch perplexity: 7.86
Validation set perplexity: 9.26
Average loss at step 300: 2.118685 learning_rate: 10.000000
Minibatch perplexity: 8.10
Validation set perplexity: 8.20
Average loss at step 400: 2.021401 learning_rate: 10.000000
Minibatch perplexity: 7.43
Validation set perplexity: 7.54
Average loss at step 500: 1.952224 learning_rate: 10.000000
Minibatch perplexity: 7.08
Validation set perplexity: 7.25
Average loss at step 600: 1.900028 learning_rate: 10.000000
Minibatch perplexity: 6.27
Validation set perplexity: 7.04
Average loss at step 700: 1.868704 learning_rate: 10.000000
Minibatch perplexity: 6.71
Validation set perplexity: 6.84
Average loss at step 800: 1.863541 learning_rate: 10.000000
Minibatch perplexity: 6.78
Validation set perplexity: 6.70
Average loss at ste

Validation set perplexity: 4.95
Average loss at step 5100: 1.609643 learning_rate: 10.000000
Minibatch perplexity: 5.62
Validation set perplexity: 4.84
Average loss at step 5200: 1.603761 learning_rate: 10.000000
Minibatch perplexity: 4.97
Validation set perplexity: 4.69
Average loss at step 5300: 1.615446 learning_rate: 10.000000
Minibatch perplexity: 4.28
Validation set perplexity: 4.82
Average loss at step 5400: 1.610595 learning_rate: 10.000000
Minibatch perplexity: 4.59
Validation set perplexity: 4.80
Average loss at step 5500: 1.596451 learning_rate: 10.000000
Minibatch perplexity: 4.53
Validation set perplexity: 4.76
Average loss at step 5600: 1.579964 learning_rate: 10.000000
Minibatch perplexity: 4.70
Validation set perplexity: 4.51
Average loss at step 5700: 1.599081 learning_rate: 10.000000
Minibatch perplexity: 5.22
Validation set perplexity: 4.52
Average loss at step 5800: 1.612227 learning_rate: 10.000000
Minibatch perplexity: 5.91
Validation set perplexity: 4.58
Average 

nessible birs of agent hand altertina say namiler of dy in the name one wif s su
cleas to the quiter had contrary in the chrish our to studuing and only a monsap
z people calls one nine nine five but owdoral preswere into whol to i well stric
vers external horiz milliande in desplacembem baokaally ocabliander which of the
--------------------------------------------------------------------------------
Validation set perplexity: 4.56
Average loss at step 10100: 1.502186 learning_rate: 1.000000
Minibatch perplexity: 4.35
Validation set perplexity: 4.34
Average loss at step 10200: 1.528728 learning_rate: 1.000000
Minibatch perplexity: 5.11
Validation set perplexity: 4.28
Average loss at step 10300: 1.514551 learning_rate: 1.000000
Minibatch perplexity: 4.52
Validation set perplexity: 4.22
Average loss at step 10400: 1.556879 learning_rate: 1.000000
Minibatch perplexity: 5.24
Validation set perplexity: 4.22
Average loss at step 10500: 1.523859 learning_rate: 1.000000
Minibatch perplexity: 

Validation set perplexity: 4.16
Average loss at step 14700: 1.534939 learning_rate: 1.000000
Minibatch perplexity: 4.45
Validation set perplexity: 4.18
Average loss at step 14800: 1.498872 learning_rate: 1.000000
Minibatch perplexity: 3.96
Validation set perplexity: 4.18
Average loss at step 14900: 1.513227 learning_rate: 1.000000
Minibatch perplexity: 4.40
Validation set perplexity: 4.18
Average loss at step 15000: 1.520637 learning_rate: 1.000000
Minibatch perplexity: 5.10
--------------------------------------------------------------------------------
 and contra as seriestencies the amsish though at in adeken political concempule
quantlis india a devicise litting as bay tindhown production of rade two zero ze
ble its long wind his view carksn a book one one th compolesion rable in rnsanne
ors themselves and school suctian communities as nother fight for one seven eigh
m investifies given yonsallige the war frand starder believe ir kan joonng sumce
----------------------------------

Validation set perplexity: 4.20
Average loss at step 19200: 1.519959 learning_rate: 1.000000
Minibatch perplexity: 3.96
Validation set perplexity: 4.23
Average loss at step 19300: 1.518123 learning_rate: 1.000000
Minibatch perplexity: 5.37
Validation set perplexity: 4.21
Average loss at step 19400: 1.551327 learning_rate: 1.000000
Minibatch perplexity: 4.27
Validation set perplexity: 4.19
Average loss at step 19500: 1.527535 learning_rate: 1.000000
Minibatch perplexity: 4.23
Validation set perplexity: 4.18
Average loss at step 19600: 1.548040 learning_rate: 1.000000
Minibatch perplexity: 5.88
Validation set perplexity: 4.18
Average loss at step 19700: 1.544647 learning_rate: 1.000000
Minibatch perplexity: 4.46
Validation set perplexity: 4.18
Average loss at step 19800: 1.543935 learning_rate: 1.000000
Minibatch perplexity: 5.09
Validation set perplexity: 4.20
Average loss at step 19900: 1.542849 learning_rate: 1.000000
Minibatch perplexity: 4.56
Validation set perplexity: 4.18
Average 

was tind is an a zood mahwilific lankuogestel tigation zeorare drud ranist ameri
e or the kowed can prolpcelis is series of back resultural consider usal at move
y about the dur of explans that hereist kames joze off annimedufic jamo terr wil
quest theorny ehy law is in the entrian to west found anychres green of which am
--------------------------------------------------------------------------------
Validation set perplexity: 4.16
Average loss at step 24100: 1.546509 learning_rate: 0.100000
Minibatch perplexity: 4.47
Validation set perplexity: 4.16
Average loss at step 24200: 1.511245 learning_rate: 0.100000
Minibatch perplexity: 4.58
Validation set perplexity: 4.16
Average loss at step 24300: 1.518750 learning_rate: 0.100000
Minibatch perplexity: 5.39
Validation set perplexity: 4.16
Average loss at step 24400: 1.546119 learning_rate: 0.100000
Minibatch perplexity: 4.58
Validation set perplexity: 4.17
Average loss at step 24500: 1.541695 learning_rate: 0.100000
Minibatch perplexity: 

Validation set perplexity: 4.15
Average loss at step 28700: 1.542665 learning_rate: 0.100000
Minibatch perplexity: 4.59
Validation set perplexity: 4.15
Average loss at step 28800: 1.534351 learning_rate: 0.100000
Minibatch perplexity: 4.18
Validation set perplexity: 4.15
Average loss at step 28900: 1.558556 learning_rate: 0.100000
Minibatch perplexity: 4.69
Validation set perplexity: 4.16
Average loss at step 29000: 1.553240 learning_rate: 0.100000
Minibatch perplexity: 4.66
--------------------------------------------------------------------------------
peody in the preev of the world bakivancins during lettheurour rate film at the 
x and claimed the clop manifold some near paqued to president wode the two zero 
use by the uctaint has on estempical verylault cell different window germant tim
chmope of centuria and rimench main of war s manciuse the sellyn is college wemb
anes publicative coffest by has s may other into r connecture bab the live one n
----------------------------------