A flexible and simple framework for sequence to sequence learning.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.
ava_nmt update ava nmt readme Feb 3, 2018
char_nmt ava nmt Dec 9, 2017
docs/figures init Aug 6, 2017
sequencing ava nmt Dec 9, 2017
sequencing_np init Aug 6, 2017
tests init Aug 6, 2017
tools add trained model Feb 3, 2018
.gitignore char nmt Dec 8, 2017
LICENSE init Aug 6, 2017
README.md note GPU Aug 20, 2017
build_inputs.py mix rl and ml Aug 28, 2017
build_model.py char nmt Dec 8, 2017
config_example.py mix rl and ml Aug 28, 2017
nmt_infer.py use argparse Aug 23, 2017
nmt_np.py use config file Aug 21, 2017
nmt_train.py char nmt Dec 8, 2017



Sequencing is a sequence to sequence learning framework based on Tensorflow. The key features of this framework are flexibility and simplicity. That is, keep simple and stay native.


  1. Python >= 3.5 (please forget Python 2)
  2. Tensorflow >= 1.2 (Better to train on GPUs)
  3. numpy >= 1.12

Please run nosetests tests to verify whether it is ready.

How to use

You may refer to the introduction in Chinese.

This code has been tested on neural machine translation (NMT), thus we would use translation as the demo. The typical architecture of NMT consists of three components, that is an encoder, an decoder and attention mechanism. You may refer to Neural Machine Translation by Jointly Learning to Align and Translate for more details.

We would introduce how to build a NMT system from scratch step by step. You may check nmt.py for the implementation.

Prepare the training data and vocabulary

We may need many parallel sentences to train a reasonable NMT system. For English to Chinese translation, we could crawl a lots of data from several websites, for example, youdao. After preparing the training data, we need to construct the vocabulary of each language. For English, we may use BPE or just words. For a toy example, 32K words are enough. For Chinese, we may use Chinese characters as the basic units instead of words for simplicity.

We also provide the pre-processed UN dataset which contains 14590228 parallel sentences. You may download it from Baidu Yun.

After preparing the dataset, we need to build the parallel inputs (English to Chinese) for training.

# load vocab
src_vocab = build_vocab(src_vocab_file, src_embedding_dim, ' ')
trg_vocab = build_vocab(trg_vocab_file, trg_embedding_dim, '')

# load parallel data
parallel_data_generator = \
    build_parallel_inputs(src_vocab, trg_vocab,
                          src_data_file, trg_data_file,
                          batch_size=batch_size, buffer_size=96,


We only implement the RNN encoder in current version. We may add the CNN encoder and self-attention encoder in the future, contribution is welcome.

We may use StackBidirectionalRNNEncoder to encoder sentences,

encoder = sq.StackBidirectionalRNNEncoder(encoder_params, name='stack_rnn', mode=mode)
encoded_representation = encoder.encode(source_embedded, source_seq_length)

source_embedded is the embedded representation of source words, which is generated by a lookup table,

source_embedding_table = sq.LookUpOp(src_vocab.vocab_size,
source_embedded = source_embedding_table(source_ids)

The encoder is illustrated in the following figure.


Attention mechanism

Attention is necessary for NMT. We implement the basic attention mechanism. Of course, it is extremely easy to extend, please check the source code.

attention_keys = encoded_representation.attention_keys
attention_values = encoded_representation.attention_values
attention_length = encoded_representation.attention_length
attention = sq.Attention(query_size, attention_keys, attention_values, attention_length)

The basic attention mechanism is very simple. For example, given a query query (usually the hidden states of the decoder), the attented context could be calculated as,

energies = v * tanh(keys + query)  # add query to each key
scores = softmax(energies)
# multiply each value with corresponding score, then sum up weighted values.
context = sum(scores * values)


The RNN decoder with attention is good enough, and we need a feedback to teach the learning process. For example,

feedback = sq.TrainingFeedBack(target_ids, target_seq_length,
                               trg_vocab, teacher_rate)
# decoder
decoder = sq.AttentionRNNDecoder(decoder_params, attention,
                                 feedback, mode=mode)

We also provide the graphical illustration, but it does not match the code exactly. The code is definitely precise.



We have build the NMT model, next we would like to train this model. Before training, we need to obtain the outputs of the model and define a loss. We use a dynamic_decode to decode from the model, which is similar to seq2seq, however simplified.

decoder_output, decoder_final_state = sq.dynamic_decode(decoder, scope='decoder')

decoder_output contains the logits, which could be used to define the loss,

predict_ids = target_ids
losses = cross_entropy_sequence_loss(

Finally, we may use an optimizer to train our model, such as GradientDescentOptimizer and AdamOptimizer.


For greedy inference, we only need to set teacher_rate to 0 in TrainingFeedBack. We also implement the inference using batched beam. Please check build_model.py for the details.


Training is very fast using the default setting (both the encoder and the decoder are single layer RNNs with 1024 hidden units). It could iterate over 30,000,000 sentences in one days (350 sentences per seconds) on a single Titan Xp. Inferring is also fast, it only takes 1 minute to translate 2000 English sentences.

Sequencing NP !

We also provide a framework that only depends on numpy, however, only for inference. We could load the trained parameters to sequencing_np and infer without using Tensorflow. For example, inferring on Android !

NMT on Android

Why this framework ?

We write this framework for lab mates and for fun. We want to keep the code simple and easy to modify.


  1. documention
  2. implement the CNN encoder and self-attention encoder.
  3. improve sequencing_np and support minpy or cupy.
  4. distributed training.