循环神经网络由于神经网络结构的进步和GPU上深度学习训练效率的突破，RNN变得越来越流行。RNN对时间序列数据非常有效，其每个神经元可通过内部组件保存之前输入的信息。

人每次思考时不会从头开始，而是保留之前思考的一些结果为现在的决策提供支持。例如我们对话时，会根据上下文的信息理解一句话的含义，而不是对每一句话从头进行分析。

例如卷积神经网络虽然可以对图像进行分类，但是可能无法对视频中每一帧图像发生的事情进行关联分析，我们无法利用前一帧图像的信息，而循环神经网络则可以解决这个问题。

RNN最大特点是神经元的某些输出可作为其输入再次传输到神经元中，因此可以利用之前的信息。

RNN虽然被设计成可以处理整个时间序列信息，但是其记忆最深的还是最后输入的一些信号。而更早之前的信号的强度则越来越低，最后只能起到一点辅助的作用，即决定RNN输出的还是最后输入的一些信号。

对于某些简单的问题，可能只需要最后输入的少量时序信息即可解决。但对某些复杂的问题，可能需要更早的一些信息，甚至是时间序列开头的信息，但间隔太远的输入信息，RNN是难以记忆的。因此长程依赖是传统RNN的致命伤。

## LSTM
包含4层神经网络

## 语言模型

语言模型是NLP中非常重要的一个部分，同时也是语音识别、机器翻译和由图片生成标题等任务的基础和关键。**语言模型是一个可以预测语言的概率模型**。给定上文的语境，即历史出现的单词，语言模型可以预测下一个单词出现的频率。

**Penn Tree Bank(PTB)是在语言模型训练中经常使用的一个数据集**，它的质量比较高，可以用来评测语言模型的准确率，同时数据集不大，训练也比较快。参考论文[Recurrent Neural Network Regularization](https://arxiv.org/pdf/1409.2329.pdf)

In [1]:
!git clone https://github.com/tensorflow/models.git

Cloning into 'models'...
remote: Counting objects: 18265, done.[K
remote: Compressing objects: 100% (35/35), done.[K
remote: Total 18265 (delta 14), reused 16 (delta 6), pack-reused 18224[K
Receiving objects: 100% (18265/18265), 470.57 MiB | 20.66 MiB/s, done.
Resolving deltas: 100% (10774/10774), done.
Checking out files: 100% (2364/2364), done.


In [0]:
import os
os.chdir('models/tutorials/rnn/ptb')
import time 
import numpy as np
import tensorflow as tf
import reader

In [3]:
#下载PTB数据集，并解压
!wget http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz
!tar xvf simple-examples.tgz

--2018-06-24 08:34:49--  http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz
Resolving www.fit.vutbr.cz (www.fit.vutbr.cz)... 147.229.9.23, 2001:67c:1220:809::93e5:917
Connecting to www.fit.vutbr.cz (www.fit.vutbr.cz)|147.229.9.23|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 34869662 (33M) [application/x-gtar]
Saving to: ‘simple-examples.tgz’


2018-06-24 08:34:59 (3.57 MB/s) - ‘simple-examples.tgz’ saved [34869662/34869662]

./
./simple-examples/
./simple-examples/data/
./simple-examples/data/ptb.test.txt
./simple-examples/data/ptb.train.txt
./simple-examples/data/ptb.valid.txt
./simple-examples/data/README
./simple-examples/data/ptb.char.train.txt
./simple-examples/data/ptb.char.test.txt
./simple-examples/data/ptb.char.valid.txt
./simple-examples/models/
./simple-examples/models/swb.ngram.model
./simple-examples/models/swb.rnn.model
./simple-examples/models/README
./simple-examples/rnnlm-0.2b/
./simple-examples/rnnlm-0.2b/CHANGE.log
./simple-exampl

In [0]:
#处理输入数据的类
class PTBInput(object):
  
  def __init__(self, config, data, name=None):
    self.batch_size = batch_size = config.batch_size
    self.num_steps = num_steps = config.num_steps # LSTM的展开步数
    self.epoch_size = ((len(data)//batch_size)-1) // num_steps
    
    self.input_data, self.targets = reader.ptb_producer(data, batch_size, num_steps, name=name) #获取特征数据input_data,以及label数据targets
    

In [0]:
#定义语言模型的类
class PTBModel(object):
  def __init__(self, is_training, config, input_):
    self._input = input_
    
    batch_size = input_.batch_size
    num_steps = input_.num_steps
    size = config.hidden_size
    vocab_size = config.vocab_size
    
    def lstm_cell():
      return tf.contrib.rnn.BasicLSTMCell(size, forget_bias=0.0, state_is_tuple=True)
  
    attn_cell = lstm_cell
    if is_training and config.keep_prob < 1:
      def attn_cell():
        return tf.contrib.rnn.DropoutWrapper(lstm_cell(), output_keep_prob=config.keep_prob)
    
    #使用RNN的堆叠函数将前面狗找到额lstm_cell多层堆叠得到cell，堆叠次数为config中的num_layers
    cell = tf.contrib.rnn.MultiRNNCell(
        [attn_cell() for _ in range(config.num_layers)],
        state_is_tuple=True
    )
  
    self._initial_state = cell.zero_state(batch_size, tf.float32)
    
    #创建网络的词嵌入embedding部分，embedding即为将one-hot的编码格式的单词转化为向量表达形式。
    with tf.device('/cpu:0'):
      embedding = tf.get_variable('embedding',[vocab_size, size], dtype=tf.float32)
      inputs = tf.nn.embedding_lookup(embedding, input_.input_data)
      
    if is_training and config.keep_prob<1:
      inputs = tf.nn.dropout(inputs, config.keep_prob)
    
    #定义输出
    outputs = []
    state = self._initial_state
    #使用variable_scope将接下来的操作的名称设为RNN
    with tf.variable_scope('RNN'):
      for time_step in range(num_steps):
        if time_step>0: tf.get_variable_scope().reuse_variables()
        (cell_output, state) = cell(inputs[:, time_step, :], state)
        outputs.append(cell_output)
    
    #将output的内容用tf.concat串接起来，用reshape将其转为一个很长的一维向量。
    output = tf.reshape(tf.concat(outputs,1),[-1,size])
    #softmax层，先定义权重softmax_w和偏置softmax_b，然后使用tf.matmul将输出output乘上权重并加上偏置得到logits，即网络最后的输出。
    softmax_w = tf.get_variable('softmax_w', [size, vocab_size], dtype=tf.float32)
    softmax_b = tf.get_variable('softmax_b', [vocab_size], dtype=tf.float32)
    
    #得到网络的最后输出
    logits = tf.matmul(output, softmax_w)+softmax_b
    
    #定义loss
    loss = tf.contrib.legacy_seq2seq.sequence_loss_by_example(
        [logits],
        [tf.reshape(input_.targets,[-1])],
        [tf.ones([batch_size*num_steps], dtype=tf.float32)]
    )
    self._cost = cost = tf.reduce_sum(loss)/batch_size
    self._final_state = state
    
    if not is_training:
      return
    
    #定义学习速率的变量_lr,并将其设为不可训练
    self._lr = tf.Variable(0.0, trainable=False)
    tvars = tf.trainable_variables()
    grads, _ = tf.clip_by_global_norm(tf.gradients(cost, tvars),config.max_grad_norm)
    
    optimizer = tf.train.GradientDescentOptimizer(self._lr)
    self._train_op = optimizer.apply_gradients(zip(grads, tvars),
                                              global_step=tf.contrib.framework.get_or_create_global_step())
    
    #设置_new_lr用以控制学习速率，同时定义操作_lr_update
    self._new_lr = tf.placeholder(tf.float32, shape=[], name='new_learning_rate')
    
    #定义_lr_update，使用tf.assign将_new_lr的值赋给当前的学习速率_lr
    self._lr_update = tf.assign(self._lr, self._new_lr)
    
  #定义assign_lr的函数，用来在外部控制模型的学习速率，方式是将学习速率值传入_new_lr这个placeholder,并执行update_lr操作完成对学习速率的修改
  def assign_lr(self, session, lr_value):
    session.run(self._lr_update, feed_dict={self._new_lr: lr_value})
  
  
  #定义PTBModel类的一些property，python中的@property装饰器可以将返回变量设为只读，防止修改变量引发的问题
  @property
  def input(self):
    return self._input
  
  @property
  def initial_state(self):
    return self._initial_state
  
  @property
  def cost(self):
    return self._cost
  
  @property
  def final_state(self):
    return self._final_state
  
  @property
  def lr(self):
    return self._lr
  
  @property
  def train_op(self):
    return self._train_op

In [0]:
#定义几种不同大小的模型的参数
#首先是小模型的设置
class SmallConfig(object):
  init_scale = 0.1 #网络中权重值的初始scale
  learning_rate = 1.0 #学习速率的初始值
  max_grad_norm = 5 #梯度的最大范数
  num_layers = 2 #LSTM可以堆叠的层数
  num_steps = 20 #LSTM梯度反向传播的展开步数
  hidden_size = 200 #LSTM内隐含节点数
  max_epoch = 4 #初始学习速率可训练的epoch数
  max_max_epoch = 13 #总共可训练的epoch数
  keep_prob = 1.0 #dropout层保留节点的比例
  lr_decay = 0.5 #学习速率衰减速度
  batch_size = 20#每个batch中样本的数量
  vocab_size = 10000

In [0]:
#MediumConfig中型模型
class MediumConfig(object):
  init_scale = 0.05 #减小了init_state,即希望权重初值不要过大，小一些有利于温和的训练
  learning_rate = 1.0
  max_grad_norm = 5
  num_layers = 2
  num_steps = 35 #将梯度反向传播的展开步数从20提升到35
  hidden_size = 650 #增大约3倍
  max_epoch = 6
  max_max_epoch = 39 #增大到3倍
  keep_prob = 0.5 #设置为0.5
  lr_decay = 0.8 #衰减速率增大
  batch_size = 20
  vocab_size = 10000

In [0]:
#LargeConfig大型模型
class LargeConfig(object):
  init_scale = 0.04 #进一步缩小了init_scale
  learning_rate = 1.0
  max_grad_norm = 10
  num_layers = 2
  num_steps = 35
  hidden_size = 1500
  max_epoch = 14
  max_max_epoch = 55
  heep_prob = 0.35
  lr_decay = 1/1.15
  batch_size = 20
  vocab_size = 10000

In [0]:
#测试用，参数都尽量使用最小值
class TestConfig(object):
  init_scale = 0.1
  learning_rate = 1.0
  max_grad_norm = 1
  num_layers = 1
  num_steps = 2
  hidden_size = 2
  max_epoch = 1
  max_max_epoch = 1
  keep_prob = 1.0
  lr_decay = 0.5
  batch_size = 20
  vocab_size =10000

In [0]:
def run_epoch(session, model, eval_op=None, verbose=False):
  start_time = time.time()
  costs = 0.0
  iters = 0
  state = session.run(model.initial_state)
  
  fetches = {
      'cost':model.cost,
      'final_state':model.final_state,
  }
  
  if eval_op is not None:
    fetches['eval_op'] = eval_op
    
  for step in range(model.input.epoch_size):
    feed_dict = {}
    for i, (c,h) in enumerate(model.initial_state):
      feed_dict[c] = state[i].c
      feed_dict[h] = state[i].h
      
    vals = session.run(fetches, feed_dict)
    cost = vals['cost']
    state = vals['final_state']
    
    costs += cost
    iters += model.input.num_steps
    
    if verbose and step % (model.input.epoch_size // 10) == 10:
      print('%.3f perplexity: %.3f speed: %.0f wps' %
            (step*1.0/model.input.epoch_size, np.exp(costs/iters),
            iters*model.input.batch_size/(time.time() - start_time))
           )
      
  return np.exp(costs/iters)

In [0]:
raw_data = reader.ptb_raw_data('simple-examples/data/')
train_data, valid_data, test_data, _ = raw_data

config = SmallConfig()
eval_config = SmallConfig()
eval_config.batch_size = 1
eval_config.num_steps = 1

In [12]:
#创建默认的graph
with tf.Graph().as_default():
  initializer = tf.random_uniform_initializer(-config.init_scale, config.init_scale)
  
  with tf.name_scope('Train'):
    train_input = PTBInput(config=config, data=train_data, name='TrainInput')
    with tf.variable_scope('Model', reuse=None, initializer=initializer):
      m=PTBModel(is_training=True, config=config, input_=train_input)
      
  with tf.name_scope('Valid'):
    valid_input = PTBInput(config=config, data=valid_data, name='ValidInput')
    
    with tf.variable_scope('Model', reuse=True, initializer=initializer):
      mvalid = PTBModel(is_training=False, config=config,input_=valid_input)
      
  with tf.name_scope('Test'):
    test_input = PTBInput(config=eval_config, data=test_data, name='TestInput')
    with tf.variable_scope('Model', reuse=True, initializer=initializer):
      mtest = PTBModel(is_training=False, config=eval_config, input_=test_input)
      
  #创建训练的管理器
  sv = tf.train.Supervisor()
  with sv.managed_session() as session: #创建默认的session
    for i in range(config.max_max_epoch):
      lr_decay = config.lr_decay ** max(i+1-config.max_max_epoch, 0.0)
      m.assign_lr(session, config.learning_rate * lr_decay)
      
      print('Epoch: %d Learning rate: %.3f' % (i+1, session.run(m.lr)))
      train_perplexity = run_epoch(session, m, eval_op=m.train_op,verbose=True)
      print('Epoch: %d Train Perplexity: %.3f' %(i+1, train_perplexity))
      
      valid_perplexity = run_epoch(session, mvalid)
      print('Epoch: %d Valid Perplexity: %.3f' % (i+1, valid_perplexity))
      
    test_perplexity = run_epoch(session, mtest)
    print('Test Perplexity: %.3f' % test_perplexity)

Instructions for updating:
Please switch to tf.train.get_or_create_global_step
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting standard services.
INFO:tensorflow:Starting queue runners.
Epoch: 1 Learning rate: 1.000
0.004 perplexity: 5670.663 speed: 5150 wps
0.104 perplexity: 842.954 speed: 8339 wps
0.204 perplexity: 628.404 speed: 8577 wps
0.304 perplexity: 509.000 speed: 8666 wps
0.404 perplexity: 438.024 speed: 8715 wps
0.504 perplexity: 391.639 speed: 8747 wps
0.604 perplexity: 352.292 speed: 8766 wps
0.703 perplexity: 325.380 speed: 8780 wps
0.803 perplexity: 304.146 speed: 8792 wps
0.903 perplexity: 284.757 speed: 8809 wps
Epoch: 1 Train Perplexity: 270.245
Epoch: 1 Valid Perplexity: 177.782
Epoch: 2 Learning rate: 1.000
0.004 perplexity: 212.162 speed: 9212 wps
0.104 perplexity: 151.099 speed: 9169 wps
0.204 perplexity: 158.688 speed: 9182 wps

0.204 perplexity: 81.648 speed: 8571 wps
0.304 perplexity: 79.890 speed: 8621 wps
0.404 perplexity: 79.799 speed: 8654 wps
0.504 perplexity: 79.700 speed: 8674 wps
0.604 perplexity: 78.205 speed: 8683 wps
0.703 perplexity: 78.167 speed: 8694 wps
0.803 perplexity: 78.122 speed: 8717 wps
0.903 perplexity: 76.840 speed: 8733 wps
Epoch: 5 Train Perplexity: 76.568
Epoch: 5 Valid Perplexity: 126.765
Epoch: 6 Learning rate: 1.000
0.004 perplexity: 90.097 speed: 9134 wps
0.104 perplexity: 66.540 speed: 9168 wps
0.204 perplexity: 73.419 speed: 9177 wps
0.304 perplexity: 71.967 speed: 9179 wps
0.404 perplexity: 71.993 speed: 9174 wps
0.504 perplexity: 71.993 speed: 9168 wps
0.604 perplexity: 70.774 speed: 9169 wps
0.703 perplexity: 70.886 speed: 9179 wps
0.803 perplexity: 70.965 speed: 9179 wps
0.903 perplexity: 69.832 speed: 9177 wps
Epoch: 6 Train Perplexity: 69.676
Epoch: 6 Valid Perplexity: 127.181
Epoch: 7 Learning rate: 1.000
0.004 perplexity: 81.584 speed: 9103 wps
0.104 perplexity: 61.53

Epoch: 10 Train Perplexity: 55.455
Epoch: 10 Valid Perplexity: 132.874
Epoch: 11 Learning rate: 1.000
0.004 perplexity: 63.758 speed: 9262 wps
0.104 perplexity: 50.051 speed: 9266 wps
0.204 perplexity: 55.026 speed: 9273 wps
0.304 perplexity: 54.080 speed: 9289 wps
0.404 perplexity: 54.191 speed: 9283 wps
0.504 perplexity: 54.388 speed: 9281 wps
0.604 perplexity: 53.735 speed: 9281 wps
0.703 perplexity: 53.942 speed: 9280 wps
0.803 perplexity: 54.119 speed: 9284 wps
0.903 perplexity: 53.465 speed: 9284 wps
Epoch: 11 Train Perplexity: 53.406
Epoch: 11 Valid Perplexity: 134.403
Epoch: 12 Learning rate: 1.000
0.004 perplexity: 62.344 speed: 9291 wps
0.104 perplexity: 48.367 speed: 9282 wps
0.204 perplexity: 53.064 speed: 9284 wps
0.304 perplexity: 52.087 speed: 9279 wps
0.404 perplexity: 52.298 speed: 9276 wps
0.504 perplexity: 52.516 speed: 9276 wps
0.604 perplexity: 51.880 speed: 9276 wps
0.703 perplexity: 52.091 speed: 9278 wps
0.803 perplexity: 52.276 speed: 9283 wps
0.903 perplexity:

在本节实现了一个基于LSTM的语言模型，LSTM在处理文本等时序数据时，LSTM可以存储状态，并依靠状态对当前的输入进行处理分析和预测。RNN和LSTM赋予了神经网络记录和存储过往信息的能力，可以模仿人类的一些简单的记忆和推理功能。

## 注意力机制
目前，注意力机制是RNN和NLP领域研究的热点。这种机制让机器可以更好的模拟人脑的功能。在图像标题生成任务中，包含注意力机制的RNN可以对某一区域的图像进行分析，并生成对应的文字描述。

可阅读论文[Show,Attend and Tell: Neural Image Caption Generation with Visual Attention](https://arxiv.org/pdf/1502.03044.pdf)