This page will give some descriptions about LSTM. This page will give some descriptions about LSTM(Long short term memory). Its explanation will be obtained from http://colah.github.io/posts/2015-08-Understanding-LSTMs/. Compared with RNN, the major difference is the design in hidden layer. RNN has a simple hidden layer design, simply using previous state, current input to multiply hidden neurons, and then activate it with activation function. However, with LSTM, it has a more complexed design, with many gates, such as forget gate, input gate, candidate gate and output gate, which will help reduce the vanishing gradient problem, where we can have more layers and more inputs(i.e. more historical info).
Next I will use the tensorflow to implement LSTM for prediction in Penn Tree Bank(PTB) dataset. The program is modified from tensorflow official website.

In [1]:
import numpy as np
import tensorflow as tf
import reader

DATA_PATH = "simple-examples/data/"
VOCAB_SIZE = 10000

HIDDEN_SIZE = 200
NUM_LAYERS = 2
LEARNING_RATE = 1.0
KEEP_PROB = 0.5
MAX_GRAD_NORM = 5

TRAIN_BATCH_SIZE = 20
TRAIN_NUM_STEP = 35

EVAL_BATCH_SIZE = 1
EVAL_NUM_STEP = 1
NUM_EPOCH = 2

class PTBModel:
    def __init__(self, is_training, batch_size, num_steps):
        self.batch_size = batch_size
        self.num_steps  = num_steps
       
        self.input_data = tf.placeholder(tf.int32, [batch_size, num_steps])
        self.targets    = tf.placeholder(tf.int32, [batch_size, num_steps])
      
        stack_rnn = []
        for i in range(NUM_LAYERS):
            lstm_cell = tf.nn.rnn_cell.BasicLSTMCell(num_units = HIDDEN_SIZE, state_is_tuple = True, reuse=tf.get_variable_scope().reuse)
            if is_training:
                lstm_cell = tf.contrib.rnn.DropoutWrapper (lstm_cell, output_keep_prob = KEEP_PROB)
            stack_rnn.append(lstm_cell)
            
            
        cell = tf.contrib.rnn.MultiRNNCell(stack_rnn, state_is_tuple = True)
            
        self.initial_state = cell.zero_state(batch_size, tf.float32)
        
        embedding = tf.get_variable("embedding", [VOCAB_SIZE, HIDDEN_SIZE])
        # map input data integer into a 1 * HIDDEN_SIZE vector
        # that means input becomes variables.
        inputs = tf.nn.embedding_lookup(embedding, self.input_data)
        # inputs shape is batch size * num_steps * HIDDEN_SIZE
        if is_training:
            inputs = tf.nn.dropout(inputs, KEEP_PROB)
        
        outputs = []
        state = self.initial_state
        with tf.variable_scope("RNN"):
            for time_step in range(num_steps):
                if time_step > 0: 
                    tf.get_variable_scope().reuse_variables() # to make sure cell_output, state can be reused.
                cell_output, state = cell(inputs[:, time_step, :], state)
                outputs.append(cell_output)
        
        #print(outputs[0], len(outputs))
        output = tf.reshape(tf.concat(outputs, 1), [-1, HIDDEN_SIZE])
        # merge different batchs to form (batch_num * num_steps) * HIDDEN_SIZE
        #print(output.shape)

        # softmax layer
        softmax_weight = tf.get_variable("softmax_w", [HIDDEN_SIZE, VOCAB_SIZE])
        softmax_bias   = tf.get_variable("softmax_b", [VOCAB_SIZE])

        # loss
        logits = tf.matmul(output, softmax_weight) + softmax_bias # becomes one-hot representation

        #print(self.targets.shape)
        #print(logits.shape)
        
        loss = tf.contrib.legacy_seq2seq.sequence_loss_by_example(
            [logits], # (batch_num * num_steps) * HIDDEN_SIZE
            [tf.reshape(self.targets, [-1])], # (batch_num * num_steps) * 1 
            [tf.ones([batch_size * num_steps], dtype=tf.float32)]  # weight in each point in batch_size * num_steps.
            )
        self.cost = tf.reduce_sum(loss) / batch_size
        self.final_state = state

        if not is_training: return
        trainable_variables = tf.trainable_variables()
        # get all variables for training
        #print(trainable_variables)

        grads, _ = tf.clip_by_global_norm(tf.gradients(self.cost, trainable_variables), MAX_GRAD_NORM)
        optimizer = tf.train.GradientDescentOptimizer(LEARNING_RATE)

        self.train_op = optimizer.apply_gradients(zip(grads, trainable_variables))
        
        
def run_epoch(session, model, data, train_op, output_log, epoch_size):
    total_costs = 0.0
    iters = 0
    state = session.run(model.initial_state)
       
    for step in range(epoch_size):
        x, y =session.run(data)
        # for example: x = [ 32, 3, 44, 55, 66, 4554, 554 ], then y = [3, 44, 55, 66, 4554, 554, xxxx]
        # print(x.shape)
        cost, state, _ = session.run([model.cost, model.final_state, train_op], {model.input_data: x, model.targets: y, model.initial_state: state})
        total_costs += cost
        iters += model.num_steps
        
        if output_log and step % 100 == 0:
            print("After %d steps, perplexity is %.3f"%(step, np.exp(total_costs / iters)))
            
    return np.exp(total_costs / iters)
        
        
        
        
def main():
    train_data, valid_data, test_data, _ = reader.ptb_raw_data(DATA_PATH)
    # the basic idea in ptb_raw_data is that we obtain words and do counts for each word, replace words by its sorted index
    # for example, (a, d, a ,b, a, c, d)=> {a:3, d:2, b:1, c:1} => a is replaced by 0, d = 1, b = 2, c = 3
    #print(train_data[1:200])
    print(tf.__version__)
    train_data_len   = len(train_data)
    train_batch_len  = train_data_len // TRAIN_BATCH_SIZE
    train_epoch_size = (train_batch_len - 1) // TRAIN_NUM_STEP 
    
    #print(train_data_len, train_batch_len, train_epoch_size)
    
    valid_data_len   = len(valid_data)
    valid_batch_len  = valid_data_len // EVAL_BATCH_SIZE
    valid_epoch_size = (valid_batch_len - 1) // EVAL_NUM_STEP 
    
    #print(valid_data_len, train_batch_len, valid_epoch_size)
    
    test_data_len   = len(test_data)
    test_batch_len  = test_data_len // EVAL_BATCH_SIZE
    test_epoch_size = (test_batch_len - 1) // EVAL_NUM_STEP 
    
    #print(test_data_len, test_batch_len, test_epoch_size)
    
    initializer = tf.random_uniform_initializer(-0.05, 0.05) 
    # use for variables initialization in tf scope.
    #print(initializer)
    
    with tf.variable_scope("language_model", reuse=tf.AUTO_REUSE, initializer=initializer):
        train_model = PTBModel(True, TRAIN_BATCH_SIZE, TRAIN_NUM_STEP)
       
    with tf.variable_scope("language_model", reuse=True, initializer=initializer):
        eval_model  = PTBModel(False, EVAL_BATCH_SIZE, EVAL_NUM_STEP)
    
    
    with tf.Session() as session:
        tf.global_variables_initializer().run()
        
        train_queue = reader.ptb_producer(train_data, train_model.batch_size, train_model.num_steps)
        eval_queue  = reader.ptb_producer(valid_data, eval_model.batch_size,  eval_model.num_steps)
        test_queue  = reader.ptb_producer(test_data,  eval_model.batch_size,  eval_model.num_steps)
        
        #print(train_queue)
    
        coord = tf.train.Coordinator()
        threads = tf.train.start_queue_runners(sess=session, coord=coord)
        # not clear here. each queue(train queue, eval queue, test queue) has its own thread???
        
        
        for i in range(NUM_EPOCH):
            print("In iteration: %d" % (i+1))
            run_epoch(session, train_model, train_queue, train_model.train_op, True, train_epoch_size)
    
            valid_perplexity = run_epoch(session, eval_model, eval_queue, tf.no_op(), False, valid_epoch_size)
        
            print("Epoch: %d Validation Perplexity: %.3f" % (i+1, valid_perplexity))
    
        
        test_perplexity = run_epoch(session, eval_model, test_queue, tf.no_op(), False, test_epoch_size)
        print("Test Perplexity: %.3f" % test_perplexity)
        
        
        coord.request_stop()   # request stop multi-thread
        coord.join(threads)    # wait for all threads stops
        
 
if __name__ == "__main__":
    main()



W0818 21:48:29.086984  5472 deprecation_wrapper.py:119] From D:\ai_lab\LSTM\reader.py:31: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.



1.14.0


W0818 21:48:29.748011  5472 deprecation.py:323] From <ipython-input-1-b7f4f743e503>:31: BasicLSTMCell.__init__ (from tensorflow.python.ops.rnn_cell_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This class is equivalent as tf.keras.layers.LSTMCell, and will be replaced by that in Tensorflow 2.0.
W0818 21:48:32.000098  5472 lazy_loader.py:50] 
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

W0818 21:48:32.031680  5472 deprecation.py:323] From <ipython-input-1-b7f4f743e503>:37: MultiRNNCell.__init__ (from tensorflow.python.ops.rnn_cell_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This class is e

In iteration: 1
After 0 steps, perplexity is 9986.188
After 100 steps, perplexity is 1484.217
After 200 steps, perplexity is 1098.021
After 300 steps, perplexity is 912.162
After 400 steps, perplexity is 796.410
After 500 steps, perplexity is 717.616
After 600 steps, perplexity is 658.244
After 700 steps, perplexity is 607.075
After 800 steps, perplexity is 561.098
After 900 steps, perplexity is 525.234
After 1000 steps, perplexity is 497.539
After 1100 steps, perplexity is 471.414
After 1200 steps, perplexity is 449.789
After 1300 steps, perplexity is 430.604
Epoch: 1 Validation Perplexity: 248.431
In iteration: 2
After 0 steps, perplexity is 356.195
After 100 steps, perplexity is 245.642
After 200 steps, perplexity is 250.828
After 300 steps, perplexity is 251.560
After 400 steps, perplexity is 248.274
After 500 steps, perplexity is 245.453
After 600 steps, perplexity is 244.791
After 700 steps, perplexity is 242.148
After 800 steps, perplexity is 237.386
After 900 steps, perplexity 

梯度问题
了解RNN大概过程后，这里提出一个神经网络存在的的经典问题，梯度问题
梯度爆炸 (梯度值太大， 调参没有意义)
如何确定是否出现梯度爆炸？
           训练过程中出现梯度爆炸会伴随一些细微的信号，如：

           模型无法从训练数据中获得更新（如低损失）。

           模型不稳定，导致更新过程中的损失出现显著变化。

           训练过程中，模型损失变成NaN。
解决方案：
梯度剪切这个方案主要是针对梯度爆炸提出的，其思想是设置一个梯度剪切阈值，然后更新梯度的时候，如果梯度超过这个阈值，那么就将其强制限制在这个范围之内。这可以防止梯度爆炸。

权重正则化（weithts regularization）比较常见的是l1l1正则，和l2l2正则，在各个深度框架中都有相应的API可以使用正则化，
relu、leakrelu、elu等激活函数
注：事实上，在深度神经网络中，往往是梯度消失出现的更多一些。

梯度消失：(梯度接近于0， 调参没有方向感)

由前文可见，RNN可以带上记忆。假设，一个生成下一个单词的例子：“这顿饭真好”->“吃”，很明显，我们只要前5个字就能猜到下一个字是什么了。当只有一个好的时候，没办法知道是好吃还是好玩还是好什么，只有当记忆连带了5个历史的是时候，这顿饭真、、、，知道了饭才知道好后面应该跟着吃，这就是为什么rnn带上记忆之后能在文本生成上进行一个很好的处理。

不过这里还有一个问题就是，如果这个长度要够长的话你才能知道前文信息有多少，只有足够长足够长才能存储够多的记忆，比如有个人跟你聊天聊了一整天，上午的时候问了你一个笑话，下午的时候问你，诶穿山甲说了什么？能回答吗，或者说几天之后问的的呢，显然不能回答。如果说这个rnn不能记忆到几天前的s的话，那么其实这个处理能力时为零的，因为它还是不知道你的上下文前提是什么。也就是说，它和bp一样，都有一个梯度消失的问题，很难学习到长期的依赖，随着传播和时间的流逝不断的衰减，第一次传入的时候对决策的影响较大，传到第二个的时候变的比较小，第三个更小，经过5到6次的传播，对决策基本上没什么作用了。

在求梯度的时候，矩阵中存在比较小的值，多个矩阵相乘会使梯度值以指数级速度下降，最终在几步后完全消失。比较远的时刻的梯度值为0，这些时刻的状态对学习过程没有帮助，导致你无法学习到长距离依赖。消失梯度问题不仅出现在RNN中，同样也出现在深度前向神经网络中。
