一般构建一个比较完整的模型，都要分三个类进行编码，分别是数据处理类(data_utils)，运行类(translate)，和模型类(seq2seq_model)。三个类分别至少必须完成各自最基本的任务：

In [5]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import random
import os
import gzip
import urllib
import time
import logging
import sys #用于本地输入测试
import math

import numpy as np
import tensorflow as tf

from urllib.request import urlopen
from six.moves import xrange #返回xrange类，相较于range，适用于长队列

数据处理类，必需实现以下方法：
- 获取数据，如果有必要则通过urllib或scrapy，从网上抓取
- 处理数据，分类，分行，去噪等工作
- 准备数据，根据模型需要，准备所有数据，确保打开就能够使用


In [8]:
def data_download(filename, url):
    if not os.path.exists(filename):
        print("Downloading %s to %s" %(url, filename))
        filename, _ = urllib.request.urlretrieve(url, filename)
        statinfo = os.stat(filename)
        print("Successfully downloaded", filename, statinfo.st_size, "bytes")
    return filename

In [10]:
url = "http://www.statmt.org/wmt10/training-giga-fren.tar"
filename = "training-giga-fren.tar"
data_download(filename, url)

Downloading http://www.statmt.org/wmt10/training-giga-fren.tar to training-giga-fren.tar


KeyboardInterrupt: 

运行类必需处理以下任务：
- 读取数据
- 建立模型
- 训练模型：包括训练，记录和读取checkpoint
- 测试模型

设置所有需要预设的参数

In [6]:
tf.app.flags.DEFINE_float("learning_rate", 0.5, "Learning rate")
tf.app.flags.DEFINE_float("learning_rate_decay_factor", 0.99, "Learning rate decaus by this moch")
tf.app.flags.DEFINE_float("max_gradient_norm", 5.0, "Clip gradients to this norm")
tf.app.flags.DEFINE_integer("batch_size", 64, "Batch size to use during the training")
tf.app.flags.DEFINE_integer("size", 1024, "Size of each model layer")
tf.app.flags.DEFINE_integer("num_layers", 3, "Number of layers in the model")
tf.app.flags.DEFINE_integer("from_vocab_size", 40000, "English vocabulary size")
tf.app.flags.DEFINE_integer("to_vocab_size", 40000, "French vocabulary size")
tf.app.flags.DEFINE_string("data_dir", "./tmp", "Data directory")
tf.app.flags.DEFINE_string("train_dir", "./tmp", "Training directory")
tf.app.flags.DEFINE_string("from_train_data", None, "Training data")
tf.app.flags.DEFINE_string("to_train_data", None, "Training data")
tf.app.flags.DEFINE_string("from_dev_data", None, "Training data")
tf.app.flags.DEFINE_string("to_dev_data", None, "Trraining data")
tf.app.flags.DEFINE_integer("max_train_data_size", 0, "Limit on the size of training data(0: no limit")
tf.app.flags.DEFINE_integer("steps_per_checkpoint", 200, "How many training steps to do per checkpoint")
tf.app.flags.DEFINE_boolean("decode", False, "Set to True for interactive decoding")
tf.app.flags.DEFINE_boolean("self_test", False, "Run a self-test if this is set to True")
tf.app.flags.DEFINE_boolean("use_fp16", False, "Train using fp16 instead of fp32")
FLAGS = tf.app.flags.FLAGS

ArgumentError: argument --learning_rate: conflicting option string: --learning_rate

因为重复运行，所以会出现上面重名的错误

In [10]:
_buckets = [(5, 10), (10, 15), (20, 25), (40, 50)]
# 这个适用于分类句子的长度，使得相近长度的句子能够被分配到相应的分组里面
# 因为这个算法又补足长度的说法（pad），所以这么做可以提高效率

模型类应该至少完成下面的任务：
- 构建模型：包括定义cell和拓扑结构
- 设置饲料
- 提供步训练函数并返回loss和accuracy

以下三份paper，详细的描述了这个模型，他们分别是：
- http://arxiv.org/abs/1412.7449 描述了模型的基本架构
- http://arxiv.org/abs/1409.0473 描述了单层，双向编码的时候的模型架构
- http://arxiv.org/abs/1412.2007 第三章具体描述了sampled softmax

In [12]:
class Seq2SeqModel(object):
    def __init__(self,
                source_vocab_size,
                target_vocab_size,
                buckets,
                size,
                num_layers,
                max_gradient_norm,
                batch_size,
                learning_rate,
                learning_rate_decay_factor,
                use_lstm= False,
                num_samples = 512,
                forward_only = False,
                dtype = tf.float32):
        self.source_vocab_size = source_vocab_size
        self.target_vocab_size = target_vocab_size
        self.buckets = buckets
        self.batch_size = batch_size
        self.learning_rate = tf.Variable(float(learning_rate),
                                        trainable = False,
                                        dtype = dtype)
        self.learning_rate_decay_op = self.learning_rate.assign(
            self.learning_rate * learning_rate_decay_factor)
        self.global_step = tf.Variable(0, trainable = False)
        # needs a output projection for sampled softmax
        output_projection = None
        softmax_loss_function = None
        # 如果sample集大于单词集，那么就木有意义了
        if num_samples > 0 and num_samples < self.target_vocab_size:
            w_t = tf.get_variable("proj_w", [self.target_vocab_size, size], dtype = dtype)
            w = tf.transpose(w_t)
            b = tf.get_variable("proj_b", [self.target_vocab_size], dtype = dtype)
            output_projection = (w, b)
            def sample_loss(labels, logits):
                # 第一步不知道是要干什么？？
                labels = tf.reshape(labels, [-1, 1]) 
                # We need to compute the sampled_softmax_loss using 32bit floats to avoid numerical instabilities??
                local_w_t = tf.cast(w_t, tf.float32)
                local_b = tf.cast(b, tf.float32)
                local_inputs = tf.cast(logits, tf.float32)
                return tf.cast(tf.nn.sampled_softmax_loss(weights = local_w_t,
                                                         biases = local_b,
                                                         labels = labels,
                                                         inputs = local_inputs,
                                                         num_sampled = num_samples,
                                                         num_classes = self_target_vocab_size), dtype)
            softmax_loss_function = sampled_loss
            # 生成节点: 所以生成的是一个layer而不是一个节点了？？
        def single_cell():
            return tf.contrib.rnn.GRUCell(size)
        if use_lstm:
            def single_cell():
                return tf.contrib.rnn.BasicLSTMCell(size)
        cell = single_cell()
        if num_layers > 1:
            cell = tf.contrib.rnn.MultiRNNCell([single_cell() for _ in range(num_layers)])
        # 生成模型
        def seq2seq_f(encoder_inputs, decoder_inputs, do_decode):
            return tf.contrib.legacy_seq2seq.embedding_attention_seq2seq(encoder_inputs,
                                                                        decoder_inputs,
                                                                        cell,
                                                                        num_encoder_symblos = source_vocab_size,
                                                                        num_decoder_symbles = target_vocab_size,
                                                                        embedding_size = size,
                                                                        output_projection = output_projection,
                                                                        feed_previous = do_decode, 
                                                                        dtype = dtype)
        # 设置饲料
        self.encoder_input = []
        self.decoder_input = []
        self.target_weights = []
        for i in xrange(buckets[-1][0]):
            self.encoder_inputs.append(tf.placeholder(tf.int32,
                                                     shape = [None],
                                                     name = "encoder{0}".format(i)))
        for i in xrange(buckets[-1][1] + 1):
            self.decoder_inputs.append(tf.placeholder(tf.int32,
                                                     shape = [None],
                                                     name = "decoder{0}".format(i)))
            self.target_weight.append(tf.placeholder(dtype,
                                                    shape = [None],
                                                    name = "weight{0}".format(i)))
        # 注意： our target are decoder input shifted by one
        targets = [self.decoder_input[i + 1] for i in xrange(len(self.decoder_inputs) - 1)]
        # 设置输出和损失
        if forward_only:
            self.outputs, self.loss = tf.contrib.legacy_seq2seq.model_with_buckets(self.encoder_inputs,
                                                                                  self.decoder_inputs,
                                                                                  targets,
                                                                                  self.target_weights,
                                                                                  buckets,
                                                                                  lambda x, y: seq_seq_f(x, y, True),
                                                                                  softmax_loss_function = softmax_loss_function)
            # If we use output projection, we need to project outpus for decoding
            if output_projection is not None:
                for b in xrange(len(buckets)):
                    self.outputs[b] = [
                        tf.matmul(output. output_projection[0]) + output_projection[1]
                        for output in self.outputs[b]
                    ]
        else:
            self.output. self.losses = tf.contrib.legacy_seq2seq.model_with_buckets(self.encoder_inputs,
                                                                                   self.decoder_inputs,
                                                                                   targets,
                                                                                   self.target_weights,
                                                                                   buckets,
                                                                                   lambda x, y: seq_seq_f(x, y, False),
                                                                                   softmax_loss_function = softmax_loss_function)
        # 设置更新和训练
        params = tf.trainable_variables()
        # 这部分没有看懂？？？
        if not forward_only:
            self.gradient_norms = []
            self.updates = []
            opt = tf.train.GradientDescentOptimizer(self.learning_rate)
            for b in xrange(len(buckets)):
                gradients = tf.gradients(self.loss[b], params)
                clipped_gradients, norm = tf.clip_by_global_norm(gradients, max_gradient_norm)
                self.gradient_norms.append(norm)
                self.update.append(opt.apply_gradients(zip(clipped_gtadients, params), global_step = self.global_step))

        self.saver = tf.train.Saver(tf.global_variables())
    
    def step(self, session, encoder_inputs, decoder_inputs, target_weights, bucket_id, forward_only):
        # 检测大小
        encoder_size, decoder_size = self.buckets[bucket_id]
        if len(encoder_inputs) != encoder_size:
            raise ValueError("Encoder length must be equal to the one in bucket")
        if len(encoder_inputs) != decoder_size:
            raise ValueError("Decoder lenghth must be equal to the one in bucket")
        if len(target_weights) != decoder_size:
            raise ValueError("Weight length must be equal to the one in bucket")
        
        inpur_feed = {}
        for l in xrange(encoder_size):
            input_feed[self.encoder_inputs[l].name] = encoder_inputs[l]
        for l in xrange(decoder_size):
            input_feed[self.decoder_inputs[l].name] = decoder_inputs[l]
            input_feed[self.target_weights[l].name] = target_weights[l]
        last_target = self.decoder_inputs[decoder_size].name
        input_feed[last_target] = np.zeros([self.batch_size], dtype = np.int32)
        
        if not forward_only:
            output_feed = [self.updates[bucket_id],
                          self.gradient_norms[bucket_id],
                          self.losses[bucket_id]]
        else:
            output_feed = [self.losses[bucket_id]]
            for l in xrange(decoder_size):
                output_feed.append(self.outputs[bucket_id][l])
        
        outputs = session.run(output_feed, input_feed) # 为什么没说run什么东西？？
        if not forward_only:
            return outputs[1], outputs[2], None # Gradient norm, loss, no outputs
        else:
            return None, outputs[0], outputs[1:] # No gradient norm, loss, outputs
    
    def get_batch(self, data, bucket_id):
        decoder_size, decoder_size = self.buckets[bucket_id]
        encoder_inputs, decoder_inputs = [], []
        for _ in xrange(self.batch_size):
            encoder_input, decoder_input = random.choice(data[bucket_id])
            
            # Encoder inputs are padded and then reversed
            encoder_pad = [PAD_ID] * (encoder_size - len(encoder_input))
            encoder_inputs.append(list(reversed(encoder_input + encoder_pad)))
            # Decoder inputs get an extra "GO" symbol, and are padded then
            decoder_pad_size = decoder_size - len(decoder_input) -1
            decoder_inputs.append([GO_ID] + decoder_input + [PAD_ID] * decoder_pad_size)
            # Now we create the batch-major vectors from the data selected above
            batch_encoder_inputs, batch_decoder_inputs, batch_weights = [], [], []
            # Batch encoder inputs are just re-indexed encoder_inputs
            for length_idx in xrange(encoder_size):
                batch_encoder_inputs.append(np.array([encoder_inputs[batch_idx][length_idx]
                                                     for batch_idx in xrange(self.batch_size)], dtype = np.int32))
            # Batch decoder inputs are re-indexed decoder_inputs, we create a weights
            for length_idx in xrange(decoder_size):
                batch_decoder_inputs.append(np.array([decoder_inputs[batch_idx][length_idx]
                                                     for batch_idx in xrange(self.batch_size)], dtype = np.int32))
                # Create targget_weights to be 0 for targets that are padding 
                batch_weight = np.ones(self.batch_size, dtype = np.float32)
                for batch_idx in xrange(self.batch_size):
                    # The corresponding target is docoder_input shifted by 1 forward
                    # we set weight to 0 if the corresponding target is a PAD symbol
                    if length_idx < decoder_size -1:
                        target = decoder_inputs[batch_idx][length_idx + 1]
                    if length_idx == decoder_size -1 or target == PAD_ID:
                        batch_weight[batch_idx] = 0.0
                    batch_weights.append(batch_weight)
                return batch_encoder_inputs, batch_decoder_inputs, batch_weights