diff --git a/conv_seq2seq/README.md b/conv_seq2seq/README.md
index 920c664562..5b22c2c17e 100644
--- a/conv_seq2seq/README.md
+++ b/conv_seq2seq/README.md
@@ -1,3 +1,7 @@
+The minimum PaddlePaddle version needed for the code sample in this directory is v0.11.0. If you are on a version of PaddlePaddle earlier than v0.11.0, [please update your installation](http://www.paddlepaddle.org/docs/develop/documentation/en/build_and_install/pip_install_en.html).
+
+---
+
 # Convolutional Sequence to Sequence Learning
 This model implements the work in the following paper:
 
diff --git a/ctr/README.cn.md b/ctr/README.cn.md
index a4cb6d1714..d717264c46 100644
--- a/ctr/README.cn.md
+++ b/ctr/README.cn.md
@@ -1,3 +1,7 @@
+运行本目录下的程序示例需要使用PaddlePaddle v0.10.0 版本。如果您的PaddlePaddle安装版本低于此要求，请按照[安装文档](http://www.paddlepaddle.org/docs/develop/documentation/zh/build_and_install/pip_install_cn.html)中的说明更新PaddlePaddle安装版本。
+
+---
+
 # 点击率预估
 
 以下是本例目录包含的文件以及对应说明:
diff --git a/ctr/README.md b/ctr/README.md
index 6f11ac6073..9ace483be6 100644
--- a/ctr/README.md
+++ b/ctr/README.md
@@ -1,3 +1,7 @@
+The minimum PaddlePaddle version needed for the code sample in this directory is v0.10.0. If you are on a version of PaddlePaddle earlier than v0.10.0, [please update your installation](http://www.paddlepaddle.org/docs/develop/documentation/en/build_and_install/pip_install_en.html).
+
+---
+
 # Click-Through Rate Prediction
 
 ## Introduction
diff --git a/deep_fm/README.md b/deep_fm/README.md
index aa63170c92..6e2c6fad38 100644
--- a/deep_fm/README.md
+++ b/deep_fm/README.md
@@ -1,3 +1,7 @@
+The minimum PaddlePaddle version needed for the code sample in this directory is v0.11.0. If you are on a version of PaddlePaddle earlier than v0.11.0, [please update your installation](http://www.paddlepaddle.org/docs/develop/documentation/en/build_and_install/pip_install_en.html).
+
+---
+
 # Deep Factorization Machine for Click-Through Rate prediction
 
 ## Introduction
diff --git a/dssm/README.cn.md b/dssm/README.cn.md
index 4a80c87673..140446ad2e 100644
--- a/dssm/README.cn.md
+++ b/dssm/README.cn.md
@@ -1,3 +1,7 @@
+运行本目录下的程序示例需要使用PaddlePaddle v0.10.0 版本。如果您的PaddlePaddle安装版本低于此版本要求，请按照[安装文档](http://www.paddlepaddle.org/docs/develop/documentation/zh/build_and_install/pip_install_cn.html)中的说明更新PaddlePaddle安装版本。
+
+---
+
 # 深度结构化语义模型 (Deep Structured Semantic Models, DSSM)
 DSSM使用DNN模型在一个连续的语义空间中学习文本低纬的表示向量，并且建模两个句子间的语义相似度。本例演示如何使用PaddlePaddle实现一个通用的DSSM 模型，用于建模两个字符串间的语义相似度，模型实现支持通用的数据格式，用户替换数据便可以在真实场景中使用该模型。
 
diff --git a/dssm/README.md b/dssm/README.md
index 6e3d7583a2..ad378f6cd5 100644
--- a/dssm/README.md
+++ b/dssm/README.md
@@ -1,3 +1,7 @@
+The minimum PaddlePaddle version needed for the code sample in this directory is v0.10.0. If you are on a version of PaddlePaddle earlier than v0.10.0, [please update your installation](http://www.paddlepaddle.org/docs/develop/documentation/en/build_and_install/pip_install_en.html).
+
+---
+
 # Deep Structured Semantic Models (DSSM)
 Deep Structured Semantic Models (DSSM) is simple but powerful DNN based model for matching web search queries and the URL based documents. This example demonstrates how to use PaddlePaddle to implement a generic DSSM model for modeling the semantic similarity between two strings.
 
diff --git a/fluid/DeepASR/README.md b/fluid/DeepASR/README.md
index ac385ea754..0c3c95a67a 100644
--- a/fluid/DeepASR/README.md
+++ b/fluid/DeepASR/README.md
@@ -1 +1,6 @@
-Deep ASR Kickoff
+The minimum PaddlePaddle version needed for the code sample in this directory is the lastest develop branch. If you are on a version of PaddlePaddle earlier than this, [please update your installation](http://www.paddlepaddle.org/docs/develop/documentation/en/build_and_install/pip_install_en.html).
+
+---
+### TODO
+
+This project is still under active development.
diff --git a/fluid/DeepASR/data_utils/data_reader.py b/fluid/DeepASR/data_utils/async_data_reader.py
similarity index 56%
rename from fluid/DeepASR/data_utils/data_reader.py
rename to fluid/DeepASR/data_utils/async_data_reader.py
index f7bc9c6602..03448fadcc 100644
--- a/fluid/DeepASR/data_utils/data_reader.py
+++ b/fluid/DeepASR/data_utils/async_data_reader.py
@@ -15,6 +15,9 @@
 import data_utils.augmentor.trans_mean_variance_norm as trans_mean_variance_norm
 import data_utils.augmentor.trans_add_delta as trans_add_delta
 from data_utils.util import suppress_complaints, suppress_signal
+from data_utils.util import SharedNDArray, SharedMemoryPoolManager
+from data_utils.util import DaemonProcessGroup, batch_to_ndarray
+from data_utils.util import CriticalException, ForceExitWrapper, EpochEndSignal
 
 
 class SampleInfo(object):
@@ -27,7 +30,7 @@ class SampleInfo(object):
         feature_frame_num (int): Time length of the sample.
         feature_dim (int): Feature dimension of one frame.
         label_bin_path (str): File containing the label data.
-        label_size (int): Byte count of the sample's label data. 
+        label_size (int): Byte count of the sample's label data.
         label_frame_num (int): Label number of the sample.
     """
 
@@ -48,23 +51,36 @@ def __init__(self, feature_bin_path, feature_start, feature_size,
 
 class SampleInfoBucket(object):
     """SampleInfoBucket contains paths of several description files. Feature
-    description file contains necessary information (including path of binary 
-    data, sample start position, sample byte number etc.) to access samples' 
-    feature data and the same with the label description file. SampleInfoBucket 
+    description file contains necessary information (including path of binary
+    data, sample start position, sample byte number etc.) to access samples'
+    feature data and the same with the label description file. SampleInfoBucket
     is the minimum unit to do shuffle.
 
     Args:
-        feature_bin_paths (list|tuple): Files containing the binary feature 
+        feature_bin_paths (list|tuple): Files containing the binary feature
                                         data.
-        feature_desc_paths (list|tuple): Files containing the description of 
-                                         samples' feature data. 
+        feature_desc_paths (list|tuple): Files containing the description of
+                                         samples' feature data.
         label_bin_paths (list|tuple): Files containing the binary label data.
         label_desc_paths (list|tuple): Files containing the description of
                                        samples' label data.
+        split_perturb(int): Maximum perturbation value for length of
+                            sub-sentence when splitting long sentence.
+        split_sentence_threshold(int): Sentence whose length larger than
+                                the value will trigger split operation.
+        split_sub_sentence_len(int): sub-sentence length is equal to
+                                    (split_sub_sentence_len + \
+                                     rand() % split_perturb).
     """
 
-    def __init__(self, feature_bin_paths, feature_desc_paths, label_bin_paths,
-                 label_desc_paths):
+    def __init__(self,
+                 feature_bin_paths,
+                 feature_desc_paths,
+                 label_bin_paths,
+                 label_desc_paths,
+                 split_perturb=50,
+                 split_sentence_threshold=512,
+                 split_sub_sentence_len=256):
         block_num = len(label_bin_paths)
         assert len(label_desc_paths) == block_num
         assert len(feature_bin_paths) == block_num
@@ -75,6 +91,10 @@ def __init__(self, feature_bin_paths, feature_desc_paths, label_bin_paths,
         self._feature_desc_paths = feature_desc_paths
         self._label_bin_paths = label_bin_paths
         self._label_desc_paths = label_desc_paths
+        self._split_perturb = split_perturb
+        self._split_sentence_threshold = split_sentence_threshold
+        self._split_sub_sentence_len = split_sub_sentence_len
+        self._rng = random.Random(0)
 
     def generate_sample_info_list(self):
         sample_info_list = []
@@ -101,42 +121,70 @@ def generate_sample_info_list(self):
                 label_start = int(label_desc_split[2])
                 label_size = int(label_desc_split[3])
                 label_frame_num = int(label_desc_split[4])
-
-                sample_info_list.append(
-                    SampleInfo(feature_bin_path, feature_start, feature_size,
-                               feature_frame_num, feature_dim, label_bin_path,
-                               label_start, label_size, label_frame_num))
+                assert feature_frame_num == label_frame_num
+
+                if self._split_sentence_threshold == -1 or \
+                        self._split_perturb == -1 or \
+                        self._split_sub_sentence_len == -1 \
+                        or self._split_sentence_threshold >= feature_frame_num:
+                    sample_info_list.append(
+                        SampleInfo(feature_bin_path, feature_start,
+                                   feature_size, feature_frame_num, feature_dim,
+                                   label_bin_path, label_start, label_size,
+                                   label_frame_num))
+                #split sentence
+                else:
+                    cur_frame_pos = 0
+                    cur_frame_len = 0
+                    remain_frame_num = feature_frame_num
+                    while True:
+                        if remain_frame_num > self._split_sentence_threshold:
+                            cur_frame_len = self._split_sub_sentence_len + \
+                                    self._rng.randint(0, self._split_perturb)
+                            if cur_frame_len > remain_frame_num:
+                                cur_frame_len = remain_frame_num
+                        else:
+                            cur_frame_len = remain_frame_num
+
+                        sample_info_list.append(
+                            SampleInfo(
+                                feature_bin_path, feature_start + cur_frame_pos
+                                * feature_dim * 4, cur_frame_len * feature_dim *
+                                4, cur_frame_len, feature_dim, label_bin_path,
+                                label_start + cur_frame_pos * 4, cur_frame_len *
+                                4, cur_frame_len))
+
+                        remain_frame_num -= cur_frame_len
+                        cur_frame_pos += cur_frame_len
+                        if remain_frame_num <= 0:
+                            break
 
         return sample_info_list
 
 
-class EpochEndSignal():
-    pass
-
-
-class DataReader(object):
+class AsyncDataReader(object):
     """DataReader provides basic audio sample preprocessing pipeline including
     data loading and data augmentation.
 
     Args:
         feature_file_list (str): File containing paths of feature data file and
                                  corresponding description file.
-        label_file_list (str): File containing paths of label data file and 
+        label_file_list (str): File containing paths of label data file and
                                corresponding description file.
         drop_frame_len (int): Samples whose label length above the value will be
-                              dropped.
-        process_num (int): Number of processes for processing data.
-        sample_buffer_size (int): Buffer size to indicate the maximum samples 
+                              dropped.(Using '-1' to disable the policy)
+        proc_num (int): Number of processes for processing data.
+        sample_buffer_size (int): Buffer size to indicate the maximum samples
                                   cached.
-        sample_info_buffer_size (int): Buffer size to indicate the maximum 
+        sample_info_buffer_size (int): Buffer size to indicate the maximum
                                        sample information cached.
-        batch_buffer_size (int): Buffer size to indicate the maximum batch 
+        batch_buffer_size (int): Buffer size to indicate the maximum batch
                                  cached.
-        shuffle_block_num (int): Block number indicating the minimum unit to do 
+        shuffle_block_num (int): Block number indicating the minimum unit to do
                                  shuffle.
         random_seed (int): Random seed.
-        verbose (int): If set to 0, complaints including exceptions and signal 
-                       traceback from sub-process will be suppressed. If set 
+        verbose (int): If set to 0, complaints including exceptions and signal
+                       traceback from sub-process will be suppressed. If set
                        to 1, all complaints will be printed.
     """
 
@@ -144,11 +192,11 @@ def __init__(self,
                  feature_file_list,
                  label_file_list,
                  drop_frame_len=512,
-                 process_num=10,
+                 proc_num=10,
                  sample_buffer_size=1024,
                  sample_info_buffer_size=1024,
-                 batch_buffer_size=1024,
-                 shuffle_block_num=1,
+                 batch_buffer_size=10,
+                 shuffle_block_num=10,
                  random_seed=0,
                  verbose=0):
         self._feature_file_list = feature_file_list
@@ -164,8 +212,12 @@ def __init__(self,
         self._sample_buffer_size = sample_buffer_size
         self._sample_info_buffer_size = sample_info_buffer_size
         self._batch_buffer_size = batch_buffer_size
-        self._process_num = process_num
+        self._proc_num = proc_num
+        if self._proc_num <= 2:
+            raise ValueError("Value of `proc_num` should be greater than 2.")
+        self._sample_proc_num = self._proc_num - 2
         self._verbose = verbose
+        self._force_exit = ForceExitWrapper(self._manager.Value('b', False))
 
     def generate_bucket_list(self, is_shuffle):
         if self._block_info_list is None:
@@ -199,41 +251,57 @@ def generate_bucket_list(self, is_shuffle):
     def set_transformers(self, transformers):
         self._transformers = transformers
 
-    def _sample_generator(self):
+    def recycle(self, *args):
+        for shared_ndarray in args:
+            if not isinstance(shared_ndarray, SharedNDArray):
+                raise Value("Only support recycle SharedNDArray object.")
+            shared_ndarray.recycle(self._pool_manager.pool)
+
+    def _start_async_processing(self):
         sample_info_queue = self._manager.Queue(self._sample_info_buffer_size)
         sample_queue = self._manager.Queue(self._sample_buffer_size)
         self._order_id = 0
 
-        @suppress_complaints(verbose=self._verbose)
+        @suppress_complaints(verbose=self._verbose, notify=self._force_exit)
         def ordered_feeding_task(sample_info_queue):
+            if self._verbose == 0:
+                signal.signal(signal.SIGTERM, suppress_signal)
+                signal.signal(signal.SIGINT, suppress_signal)
+
             for sample_info_bucket in self._bucket_list:
-                sample_info_list = sample_info_bucket.generate_sample_info_list(
-                )
-                self._rng.shuffle(sample_info_list)  # do shuffle here
-                for sample_info in sample_info_list:
-                    sample_info_queue.put((sample_info, self._order_id))
-                    self._order_id += 1
-
-            for i in xrange(self._process_num):
+                try:
+                    sample_info_list = \
+                            sample_info_bucket.generate_sample_info_list()
+                except Exception as e:
+                    raise CriticalException(e)
+                else:
+                    self._rng.shuffle(sample_info_list)  # do shuffle here
+                    for sample_info in sample_info_list:
+                        sample_info_queue.put((sample_info, self._order_id))
+                        self._order_id += 1
+
+            for i in xrange(self._sample_proc_num):
                 sample_info_queue.put(EpochEndSignal())
 
-        feeding_thread = Thread(
-            target=ordered_feeding_task, args=(sample_info_queue, ))
-        feeding_thread.daemon = True
-        feeding_thread.start()
+        feeding_proc = DaemonProcessGroup(
+            proc_num=1, target=ordered_feeding_task, args=(sample_info_queue, ))
+        feeding_proc.start_all()
 
-        @suppress_complaints(verbose=self._verbose)
+        @suppress_complaints(verbose=self._verbose, notify=self._force_exit)
         def ordered_processing_task(sample_info_queue, sample_queue, out_order):
             if self._verbose == 0:
                 signal.signal(signal.SIGTERM, suppress_signal)
                 signal.signal(signal.SIGINT, suppress_signal)
 
             def read_bytes(fpath, start, size):
-                f = open(fpath, 'r')
-                f.seek(start, 0)
-                binary_bytes = f.read(size)
-                f.close()
-                return binary_bytes
+                try:
+                    f = open(fpath, 'r')
+                    f.seek(start, 0)
+                    binary_bytes = f.read(size)
+                    f.close()
+                    return binary_bytes
+                except Exception as e:
+                    raise CriticalException(e)
 
             ins = sample_info_queue.get()
 
@@ -244,11 +312,21 @@ def read_bytes(fpath, start, size):
                                            sample_info.feature_start,
                                            sample_info.feature_size)
 
+                assert sample_info.feature_frame_num \
+                       * sample_info.feature_dim * 4 == len(feature_bytes), \
+                       (sample_info.feature_bin_path,
+                        sample_info.feature_frame_num,
+                        sample_info.feature_dim,
+                        len(feature_bytes))
+
                 label_bytes = read_bytes(sample_info.label_bin_path,
                                          sample_info.label_start,
                                          sample_info.label_size)
 
-                assert sample_info.label_frame_num * 4 == len(label_bytes)
+                assert sample_info.label_frame_num * 4 == len(label_bytes), (
+                    sample_info.label_bin_path, sample_info.label_array,
+                    len(label_bytes))
+
                 label_array = struct.unpack('I' * sample_info.label_frame_num,
                                             label_bytes)
                 label_data = np.array(
@@ -273,7 +351,8 @@ def read_bytes(fpath, start, size):
                     time.sleep(0.001)
 
                 # drop long sentence
-                if self._drop_frame_len >= sample_data[0].shape[0]:
+                if self._drop_frame_len == -1 or \
+                        self._drop_frame_len >= sample_data[0].shape[0]:
                     sample_queue.put(sample_data)
 
                 out_order[0] += 1
@@ -283,73 +362,76 @@ def read_bytes(fpath, start, size):
 
         out_order = self._manager.list([0])
         args = (sample_info_queue, sample_queue, out_order)
-        workers = [
-            Process(
-                target=ordered_processing_task, args=args)
-            for _ in xrange(self._process_num)
-        ]
-
-        for w in workers:
-            w.daemon = True
-            w.start()
+        sample_proc = DaemonProcessGroup(
+            proc_num=self._sample_proc_num,
+            target=ordered_processing_task,
+            args=args)
+        sample_proc.start_all()
 
-        finished_process_num = 0
+        return sample_queue
 
-        while finished_process_num < self._process_num:
-            sample = sample_queue.get()
-            if isinstance(sample, EpochEndSignal):
-                finished_process_num += 1
-                continue
-            yield sample
+    def batch_iterator(self, batch_size, minimum_batch_size):
+        @suppress_complaints(verbose=self._verbose, notify=self._force_exit)
+        def batch_assembling_task(sample_queue, batch_queue, pool):
+            def conv_to_shared(ndarray):
+                while self._force_exit == False:
+                    try:
+                        (name, shared_ndarray) = pool.popitem()
+                    except Exception as e:
+                        time.sleep(0.001)
+                    else:
+                        shared_ndarray.copy(ndarray)
+                        return shared_ndarray
 
-        feeding_thread.join()
-        for w in workers:
-            w.join()
+            if self._verbose == 0:
+                signal.signal(signal.SIGTERM, suppress_signal)
+                signal.signal(signal.SIGINT, suppress_signal)
 
-    def batch_iterator(self, batch_size, minimum_batch_size):
-        def batch_to_ndarray(batch_samples, lod):
-            assert len(batch_samples)
-            frame_dim = batch_samples[0][0].shape[1]
-            batch_feature = np.zeros((lod[-1], frame_dim), dtype="float32")
-            batch_label = np.zeros((lod[-1], 1), dtype="int64")
-            start = 0
-            for sample in batch_samples:
-                frame_num = sample[0].shape[0]
-                batch_feature[start:start + frame_num, :] = sample[0]
-                batch_label[start:start + frame_num, :] = sample[1]
-                start += frame_num
-            return (batch_feature, batch_label)
-
-        @suppress_complaints(verbose=self._verbose)
-        def batch_assembling_task(sample_generator, batch_queue):
             batch_samples = []
             lod = [0]
-            for sample in sample_generator():
-                batch_samples.append(sample)
-                lod.append(lod[-1] + sample[0].shape[0])
-                if len(batch_samples) == batch_size:
-                    (batch_feature, batch_label) = batch_to_ndarray(
-                        batch_samples, lod)
-                    batch_queue.put((batch_feature, batch_label, lod))
-                    batch_samples = []
-                    lod = [0]
+            done_num = 0
+            while done_num < self._sample_proc_num:
+                sample = sample_queue.get()
+                if isinstance(sample, EpochEndSignal):
+                    done_num += 1
+                else:
+                    batch_samples.append(sample)
+                    lod.append(lod[-1] + sample[0].shape[0])
+                    if len(batch_samples) == batch_size:
+                        feature, label = batch_to_ndarray(batch_samples, lod)
+
+                        feature = conv_to_shared(feature)
+                        label = conv_to_shared(label)
+                        lod = conv_to_shared(np.array(lod).astype('int64'))
+
+                        batch_queue.put((feature, label, lod))
+                        batch_samples = []
+                        lod = [0]
 
             if len(batch_samples) >= minimum_batch_size:
-                (batch_feature, batch_label) = batch_to_ndarray(batch_samples,
-                                                                lod)
-                batch_queue.put((batch_feature, batch_label, lod))
+                (feature, label) = batch_to_ndarray(batch_samples, lod)
+
+                feature = conv_to_shared(feature)
+                label = conv_to_shared(label)
+                lod = conv_to_shared(np.array(lod).astype('int64'))
+
+                batch_queue.put((feature, label, lod))
 
             batch_queue.put(EpochEndSignal())
 
-        batch_queue = Queue.Queue(self._batch_buffer_size)
+        sample_queue = self._start_async_processing()
+        batch_queue = self._manager.Queue(self._batch_buffer_size)
+
+        self._pool_manager = SharedMemoryPoolManager(self._batch_buffer_size *
+                                                     3, self._manager)
 
-        assembling_thread = Thread(
+        assembling_proc = DaemonProcessGroup(
+            proc_num=1,
             target=batch_assembling_task,
-            args=(self._sample_generator, batch_queue))
-        assembling_thread.daemon = True
-        assembling_thread.start()
+            args=(sample_queue, batch_queue, self._pool_manager.pool))
+        assembling_proc.start_all()
 
-        while True:
+        while self._force_exit == False:
             try:
                 batch_data = batch_queue.get_nowait()
             except Queue.Empty:
@@ -359,4 +441,5 @@ def batch_assembling_task(sample_generator, batch_queue):
                     break
                 yield batch_data
 
-        assembling_thread.join()
+        # clean the shared memory
+        del self._pool_manager
diff --git a/fluid/DeepASR/data_utils/augmentor/tests/__init__.py b/fluid/DeepASR/data_utils/augmentor/tests/__init__.py
new file mode 100644
index 0000000000..90856dc443
--- /dev/null
+++ b/fluid/DeepASR/data_utils/augmentor/tests/__init__.py
@@ -0,0 +1,7 @@
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import data_utils.augmentor.trans_mean_variance_norm as trans_mean_variance_norm
+import data_utils.augmentor.trans_add_delta as trans_add_delta
+import data_utils.augmentor.trans_splice as trans_splice
diff --git a/fluid/DeepASR/data_utils/util.py b/fluid/DeepASR/data_utils/util.py
index e64417e502..5d519c0ac3 100644
--- a/fluid/DeepASR/data_utils/util.py
+++ b/fluid/DeepASR/data_utils/util.py
@@ -1,9 +1,11 @@
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
-import sys
+import sys, time
 from six import reraise
 from tblib import Traceback
+from multiprocessing import Manager, Process
+import posix_ipc, mmap
 
 import numpy as np
 
@@ -35,21 +37,177 @@ def lodtensor_to_ndarray(lod_tensor):
     return ret, lod_tensor.lod()
 
 
+def batch_to_ndarray(batch_samples, lod):
+    frame_dim = batch_samples[0][0].shape[1]
+    batch_feature = np.zeros((lod[-1], frame_dim), dtype="float32")
+    batch_label = np.zeros((lod[-1], 1), dtype="int64")
+    start = 0
+    for sample in batch_samples:
+        frame_num = sample[0].shape[0]
+        batch_feature[start:start + frame_num, :] = sample[0]
+        batch_label[start:start + frame_num, :] = sample[1]
+        start += frame_num
+    return (batch_feature, batch_label)
+
+
+def split_infer_result(infer_seq, lod):
+    infer_batch = []
+    for i in xrange(0, len(lod[0]) - 1):
+        infer_batch.append(infer_seq[lod[0][i]:lod[0][i + 1]])
+    return infer_batch
+
+
+class DaemonProcessGroup(object):
+    def __init__(self, proc_num, target, args):
+        self._proc_num = proc_num
+        self._workers = [
+            Process(
+                target=target, args=args) for _ in xrange(self._proc_num)
+        ]
+
+    def start_all(self):
+        for w in self._workers:
+            w.daemon = True
+            w.start()
+
+    @property
+    def proc_num(self):
+        return self._proc_num
+
+
+class EpochEndSignal(object):
+    pass
+
+
+class CriticalException(Exception):
+    pass
+
+
+class SharedNDArray(object):
+    """SharedNDArray utilizes shared memory to avoid data serialization when
+    data object shared among different processes. We can reconstruct the
+    `ndarray` when memory address, shape and dtype provided.
+
+    Args:
+        name (str): Address name of shared memory.
+        whether_verify (bool): Whether to validate the writing operation.
+    """
+
+    def __init__(self, name, whether_verify=False):
+        self._name = name
+        self._shm = None
+        self._buf = None
+        self._array = np.zeros(1, dtype=np.float32)
+        self._inited = False
+        self._whether_verify = whether_verify
+
+    def zeros_like(self, shape, dtype):
+        size = int(np.prod(shape)) * np.dtype(dtype).itemsize
+        if self._inited:
+            self._shm = posix_ipc.SharedMemory(self._name)
+        else:
+            self._shm = posix_ipc.SharedMemory(
+                self._name, posix_ipc.O_CREAT, size=size)
+        self._buf = mmap.mmap(self._shm.fd, size)
+        self._array = np.ndarray(shape, dtype, self._buf, order='C')
+
+    def copy(self, ndarray):
+        size = int(np.prod(ndarray.shape)) * np.dtype(ndarray.dtype).itemsize
+        self.zeros_like(ndarray.shape, ndarray.dtype)
+        self._array[:] = ndarray
+        self._buf.flush()
+        self._inited = True
+
+        if self._whether_verify:
+            shm = posix_ipc.SharedMemory(self._name)
+            buf = mmap.mmap(shm.fd, size)
+            array = np.ndarray(ndarray.shape, ndarray.dtype, buf, order='C')
+            np.testing.assert_array_equal(array, ndarray)
+
+    @property
+    def ndarray(self):
+        return self._array
+
+    def recycle(self, pool):
+        self._buf.close()
+        self._shm.close_fd()
+        self._inited = False
+        pool[self._name] = self
+
+    def __getstate__(self):
+        return (self._name, self._array.shape, self._array.dtype, self._inited,
+                self._whether_verify)
+
+    def __setstate__(self, state):
+        self._name = state[0]
+        self._inited = state[3]
+        self.zeros_like(state[1], state[2])
+        self._whether_verify = state[4]
+
+
+class SharedMemoryPoolManager(object):
+    """SharedMemoryPoolManager maintains a multiprocessing.Manager.dict object.
+    All available addresses are allocated once and will be reused. Though this
+    class is not process-safe, the pool can be shared between processes. All
+    shared memory should be unlinked before the main process exited.
+
+    Args:
+        pool_size (int): Size of shared memory pool.
+        manager (dict): A multiprocessing.Manager object, the pool is
+                        maintained by the proxy process.
+        name_prefix (str): Address prefix of shared memory.
+    """
+
+    def __init__(self, pool_size, manager, name_prefix='/deep_asr'):
+        self._names = []
+        self._dict = manager.dict()
+        self._time_prefix = time.strftime('%Y%m%d%H%M%S')
+
+        for i in xrange(pool_size):
+            name = name_prefix + '_' + self._time_prefix + '_' + str(i)
+            self._dict[name] = SharedNDArray(name)
+            self._names.append(name)
+
+    @property
+    def pool(self):
+        return self._dict
+
+    def __del__(self):
+        for name in self._names:
+            # have to unlink the shared memory
+            posix_ipc.unlink_shared_memory(name)
+
+
 def suppress_signal(signo, stack_frame):
     pass
 
 
-def suppress_complaints(verbose):
+def suppress_complaints(verbose, notify=None):
     def decorator_maker(func):
         def suppress_warpper(*args, **kwargs):
             try:
                 func(*args, **kwargs)
             except:
                 et, ev, tb = sys.exc_info()
-                tb = Traceback(tb)
-                if verbose == 1:
-                    reraise(et, ev, tb.as_traceback())
+
+                if notify is not None:
+                    notify(except_type=et, except_value=ev, traceback=tb)
+
+                if verbose == 1 or isinstance(ev, CriticalException):
+                    reraise(et, ev, Traceback(tb).as_traceback())
 
         return suppress_warpper
 
     return decorator_maker
+
+
+class ForceExitWrapper(object):
+    def __init__(self, exit_flag):
+        self._exit_flag = exit_flag
+
+    @suppress_complaints(verbose=0)
+    def __call__(self, *args, **kwargs):
+        self._exit_flag.value = True
+
+    def __eq__(self, flag):
+        return self._exit_flag.value == flag
diff --git a/fluid/DeepASR/decoder/post_decode_faster.cc b/fluid/DeepASR/decoder/post_decode_faster.cc
new file mode 100644
index 0000000000..d7f1d1ab34
--- /dev/null
+++ b/fluid/DeepASR/decoder/post_decode_faster.cc
@@ -0,0 +1,144 @@
+/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "post_decode_faster.h"
+
+typedef kaldi::int32 int32;
+using fst::SymbolTable;
+using fst::VectorFst;
+using fst::StdArc;
+
+Decoder::Decoder(std::string word_syms_filename,
+                 std::string fst_in_filename,
+                 std::string logprior_rxfilename) {
+  const char* usage =
+      "Decode, reading log-likelihoods (of transition-ids or whatever symbol "
+      "is on the graph) as matrices.";
+
+  kaldi::ParseOptions po(usage);
+  binary = true;
+  acoustic_scale = 1.5;
+  allow_partial = true;
+  kaldi::FasterDecoderOptions decoder_opts;
+  decoder_opts.Register(&po, true);  // true == include obscure settings.
+  po.Register("binary", &binary, "Write output in binary mode");
+  po.Register("allow-partial",
+              &allow_partial,
+              "Produce output even when final state was not reached");
+  po.Register("acoustic-scale",
+              &acoustic_scale,
+              "Scaling factor for acoustic likelihoods");
+
+  word_syms = NULL;
+  if (word_syms_filename != "") {
+    word_syms = fst::SymbolTable::ReadText(word_syms_filename);
+    if (!word_syms)
+      KALDI_ERR << "Could not read symbol table from file "
+                << word_syms_filename;
+  }
+
+  std::ifstream is_logprior(logprior_rxfilename);
+  logprior.Read(is_logprior, false);
+
+  // It's important that we initialize decode_fst after loglikes_reader, as it
+  // can prevent crashes on systems installed without enough virtual memory.
+  // It has to do with what happens on UNIX systems if you call fork() on a
+  // large process: the page-table entries are duplicated, which requires a
+  // lot of virtual memory.
+  decode_fst = fst::ReadFstKaldi(fst_in_filename);
+
+  decoder = new kaldi::FasterDecoder(*decode_fst, decoder_opts);
+}
+
+
+Decoder::~Decoder() {
+  if (!word_syms) delete word_syms;
+  delete decode_fst;
+  delete decoder;
+}
+
+std::string Decoder::decode(
+    std::string key,
+    const std::vector<std::vector<kaldi::BaseFloat>>& log_probs) {
+  size_t num_frames = log_probs.size();
+  size_t dim_label = log_probs[0].size();
+
+  kaldi::Matrix<kaldi::BaseFloat> loglikes(
+      num_frames, dim_label, kaldi::kSetZero, kaldi::kStrideEqualNumCols);
+  for (size_t i = 0; i < num_frames; ++i) {
+    memcpy(loglikes.Data() + i * dim_label,
+           log_probs[i].data(),
+           sizeof(kaldi::BaseFloat) * dim_label);
+  }
+
+  return decode(key, loglikes);
+}
+
+
+std::vector<std::string> Decoder::decode(std::string posterior_rspecifier) {
+  kaldi::SequentialBaseFloatMatrixReader posterior_reader(posterior_rspecifier);
+  std::vector<std::string> decoding_results;
+
+  for (; !posterior_reader.Done(); posterior_reader.Next()) {
+    std::string key = posterior_reader.Key();
+    kaldi::Matrix<kaldi::BaseFloat> loglikes(posterior_reader.Value());
+
+    decoding_results.push_back(decode(key, loglikes));
+  }
+
+  return decoding_results;
+}
+
+
+std::string Decoder::decode(std::string key,
+                            kaldi::Matrix<kaldi::BaseFloat>& loglikes) {
+  std::string decoding_result;
+
+  if (loglikes.NumRows() == 0) {
+    KALDI_WARN << "Zero-length utterance: " << key;
+  }
+  KALDI_ASSERT(loglikes.NumCols() == logprior.Dim());
+
+  loglikes.ApplyLog();
+  loglikes.AddVecToRows(-1.0, logprior);
+
+  kaldi::DecodableMatrixScaled decodable(loglikes, acoustic_scale);
+  decoder->Decode(&decodable);
+
+  VectorFst<kaldi::LatticeArc> decoded;  // linear FST.
+
+  if ((allow_partial || decoder->ReachedFinal()) &&
+      decoder->GetBestPath(&decoded)) {
+    if (!decoder->ReachedFinal())
+      KALDI_WARN << "Decoder did not reach end-state, outputting partial "
+                    "traceback.";
+
+    std::vector<int32> alignment;
+    std::vector<int32> words;
+    kaldi::LatticeWeight weight;
+
+    GetLinearSymbolSequence(decoded, &alignment, &words, &weight);
+
+    if (word_syms != NULL) {
+      for (size_t i = 0; i < words.size(); i++) {
+        std::string s = word_syms->Find(words[i]);
+        decoding_result += s;
+        if (s == "")
+          KALDI_ERR << "Word-id " << words[i] << " not in symbol table.";
+      }
+    }
+  }
+
+  return decoding_result;
+}
diff --git a/fluid/DeepASR/decoder/post_decode_faster.h b/fluid/DeepASR/decoder/post_decode_faster.h
new file mode 100644
index 0000000000..2e31a1c19e
--- /dev/null
+++ b/fluid/DeepASR/decoder/post_decode_faster.h
@@ -0,0 +1,57 @@
+/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <string>
+#include <vector>
+#include "base/kaldi-common.h"
+#include "base/timer.h"
+#include "decoder/decodable-matrix.h"
+#include "decoder/faster-decoder.h"
+#include "fstext/fstext-lib.h"
+#include "hmm/transition-model.h"
+#include "lat/kaldi-lattice.h"  // for {Compact}LatticeArc
+#include "tree/context-dep.h"
+#include "util/common-utils.h"
+
+
+class Decoder {
+public:
+  Decoder(std::string word_syms_filename,
+          std::string fst_in_filename,
+          std::string logprior_rxfilename);
+  ~Decoder();
+
+  // Interface to accept the scores read from specifier and return
+  // the batch decoding results
+  std::vector<std::string> decode(std::string posterior_rspecifier);
+
+  // Accept the scores of one utterance and return the decoding result
+  std::string decode(
+      std::string key,
+      const std::vector<std::vector<kaldi::BaseFloat>> &log_probs);
+
+private:
+  // For decoding one utterance
+  std::string decode(std::string key,
+                     kaldi::Matrix<kaldi::BaseFloat> &loglikes);
+
+  fst::SymbolTable *word_syms;
+  fst::VectorFst<fst::StdArc> *decode_fst;
+  kaldi::FasterDecoder *decoder;
+  kaldi::Vector<kaldi::BaseFloat> logprior;
+
+  bool binary;
+  kaldi::BaseFloat acoustic_scale;
+  bool allow_partial;
+};
diff --git a/fluid/DeepASR/decoder/pybind.cc b/fluid/DeepASR/decoder/pybind.cc
new file mode 100644
index 0000000000..56439d1802
--- /dev/null
+++ b/fluid/DeepASR/decoder/pybind.cc
@@ -0,0 +1,39 @@
+/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <pybind11/pybind11.h>
+#include <pybind11/stl.h>
+
+#include "post_decode_faster.h"
+
+namespace py = pybind11;
+
+PYBIND11_MODULE(post_decode_faster, m) {
+  m.doc() = "Decoder for Deep ASR model";
+
+  py::class_<Decoder>(m, "Decoder")
+      .def(py::init<std::string, std::string, std::string>())
+      .def("decode",
+           (std::vector<std::string> (Decoder::*)(std::string)) &
+               Decoder::decode,
+           "Decode for the probability matrices in specifier "
+           "and return the transcriptions.")
+      .def(
+          "decode",
+          (std::string (Decoder::*)(
+              std::string, const std::vector<std::vector<kaldi::BaseFloat>>&)) &
+              Decoder::decode,
+          "Decode one input probability matrix "
+          "and return the transcription.");
+}
diff --git a/fluid/DeepASR/decoder/setup.py b/fluid/DeepASR/decoder/setup.py
new file mode 100644
index 0000000000..a98c0b4cc1
--- /dev/null
+++ b/fluid/DeepASR/decoder/setup.py
@@ -0,0 +1,71 @@
+#  Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import glob
+from distutils.core import setup, Extension
+from distutils.sysconfig import get_config_vars
+
+try:
+    kaldi_root = os.environ['KALDI_ROOT']
+except:
+    raise ValueError("Enviroment variable 'KALDI_ROOT' is not defined. Please "
+                     "install kaldi and export KALDI_ROOT=<kaldi's root dir> .")
+
+args = [
+    '-std=c++11', '-Wno-sign-compare', '-Wno-unused-variable',
+    '-Wno-unused-local-typedefs', '-Wno-unused-but-set-variable',
+    '-Wno-deprecated-declarations', '-Wno-unused-function'
+]
+
+# remove warning about -Wstrict-prototypes
+(opt, ) = get_config_vars('OPT')
+os.environ['OPT'] = " ".join(flag for flag in opt.split()
+                             if flag != '-Wstrict-prototypes')
+os.environ['CC'] = 'g++'
+
+LIBS = [
+    'fst', 'kaldi-base', 'kaldi-util', 'kaldi-matrix', 'kaldi-tree',
+    'kaldi-hmm', 'kaldi-fstext', 'kaldi-decoder', 'kaldi-lat'
+]
+
+LIB_DIRS = [
+    'tools/openfst/lib', 'src/base', 'src/matrix', 'src/util', 'src/tree',
+    'src/hmm', 'src/fstext', 'src/decoder', 'src/lat'
+]
+LIB_DIRS = [os.path.join(kaldi_root, path) for path in LIB_DIRS]
+LIB_DIRS = [os.path.abspath(path) for path in LIB_DIRS]
+
+ext_modules = [
+    Extension(
+        'post_decode_faster',
+        ['pybind.cc', 'post_decode_faster.cc'],
+        include_dirs=[
+            'pybind11/include', '.', os.path.join(kaldi_root, 'src'),
+            os.path.join(kaldi_root, 'tools/openfst/src/include')
+        ],
+        language='c++',
+        libraries=LIBS,
+        library_dirs=LIB_DIRS,
+        runtime_library_dirs=LIB_DIRS,
+        extra_compile_args=args, ),
+]
+
+setup(
+    name='post_decode_faster',
+    version='0.0.1',
+    author='Paddle',
+    author_email='',
+    description='Decoder for Deep ASR model',
+    ext_modules=ext_modules, )
diff --git a/fluid/DeepASR/decoder/setup.sh b/fluid/DeepASR/decoder/setup.sh
new file mode 100644
index 0000000000..1471f85f41
--- /dev/null
+++ b/fluid/DeepASR/decoder/setup.sh
@@ -0,0 +1,7 @@
+set -e
+
+if [ ! -d pybind11 ]; then
+    git clone https://github.com/pybind/pybind11.git
+fi 
+
+python setup.py build_ext -i 
diff --git a/fluid/DeepASR/infer.py b/fluid/DeepASR/infer.py
new file mode 100644
index 0000000000..babcb416ea
--- /dev/null
+++ b/fluid/DeepASR/infer.py
@@ -0,0 +1,107 @@
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+import argparse
+import paddle.fluid as fluid
+import data_utils.augmentor.trans_mean_variance_norm as trans_mean_variance_norm
+import data_utils.augmentor.trans_add_delta as trans_add_delta
+import data_utils.augmentor.trans_splice as trans_splice
+import data_utils.data_reader as reader
+from data_utils.util import lodtensor_to_ndarray
+from data_utils.util import split_infer_result
+
+
+def parse_args():
+    parser = argparse.ArgumentParser("Inference for stacked LSTMP model.")
+    parser.add_argument(
+        '--batch_size',
+        type=int,
+        default=32,
+        help='The sequence number of a batch data. (default: %(default)d)')
+    parser.add_argument(
+        '--device',
+        type=str,
+        default='GPU',
+        choices=['CPU', 'GPU'],
+        help='The device type. (default: %(default)s)')
+    parser.add_argument(
+        '--mean_var',
+        type=str,
+        default='data/global_mean_var_search26kHr',
+        help="The path for feature's global mean and variance. "
+        "(default: %(default)s)")
+    parser.add_argument(
+        '--infer_feature_lst',
+        type=str,
+        default='data/infer_feature.lst',
+        help='The feature list path for inference. (default: %(default)s)')
+    parser.add_argument(
+        '--infer_label_lst',
+        type=str,
+        default='data/infer_label.lst',
+        help='The label list path for inference. (default: %(default)s)')
+    parser.add_argument(
+        '--infer_model_path',
+        type=str,
+        default='./infer_models/deep_asr.pass_0.infer.model/',
+        help='The directory for loading inference model. '
+        '(default: %(default)s)')
+    args = parser.parse_args()
+    return args
+
+
+def print_arguments(args):
+    print('-----------  Configuration Arguments -----------')
+    for arg, value in sorted(vars(args).iteritems()):
+        print('%s: %s' % (arg, value))
+    print('------------------------------------------------')
+
+
+def infer(args):
+    """ Gets one batch of feature data and predicts labels for each sample.
+    """
+
+    if not os.path.exists(args.infer_model_path):
+        raise IOError("Invalid inference model path!")
+
+    place = fluid.CUDAPlace(0) if args.device == 'GPU' else fluid.CPUPlace()
+    exe = fluid.Executor(place)
+
+    # load model
+    [infer_program, feed_dict,
+     fetch_targets] = fluid.io.load_inference_model(args.infer_model_path, exe)
+
+    ltrans = [
+        trans_add_delta.TransAddDelta(2, 2),
+        trans_mean_variance_norm.TransMeanVarianceNorm(args.mean_var),
+        trans_splice.TransSplice()
+    ]
+
+    infer_data_reader = reader.DataReader(args.infer_feature_lst,
+                                          args.infer_label_lst)
+    infer_data_reader.set_transformers(ltrans)
+
+    feature_t = fluid.LoDTensor()
+    one_batch = infer_data_reader.batch_iterator(args.batch_size, 1).next()
+    (features, labels, lod) = one_batch
+    feature_t.set(features, place)
+    feature_t.set_lod([lod])
+
+    results = exe.run(infer_program,
+                      feed={feed_dict[0]: feature_t},
+                      fetch_list=fetch_targets,
+                      return_numpy=False)
+
+    probs, lod = lodtensor_to_ndarray(results[0])
+    preds = probs.argmax(axis=1)
+    infer_batch = split_infer_result(preds, lod)
+    for index, sample in enumerate(infer_batch):
+        print("result %d: " % index, sample, '\n')
+
+
+if __name__ == '__main__':
+    args = parse_args()
+    print_arguments(args)
+    infer(args)
diff --git a/fluid/DeepASR/infer_by_ckpt.py b/fluid/DeepASR/infer_by_ckpt.py
new file mode 100644
index 0000000000..f267f67498
--- /dev/null
+++ b/fluid/DeepASR/infer_by_ckpt.py
@@ -0,0 +1,193 @@
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import sys
+import os
+import numpy as np
+import argparse
+import time
+
+import paddle.fluid as fluid
+import data_utils.augmentor.trans_mean_variance_norm as trans_mean_variance_norm
+import data_utils.augmentor.trans_add_delta as trans_add_delta
+import data_utils.augmentor.trans_splice as trans_splice
+import data_utils.async_data_reader as reader
+from decoder.post_decode_faster import Decoder
+from data_utils.util import lodtensor_to_ndarray
+from model_utils.model import stacked_lstmp_model
+from data_utils.util import split_infer_result
+
+
+def parse_args():
+    parser = argparse.ArgumentParser("Run inference by using checkpoint.")
+    parser.add_argument(
+        '--batch_size',
+        type=int,
+        default=32,
+        help='The sequence number of a batch data. (default: %(default)d)')
+    parser.add_argument(
+        '--minimum_batch_size',
+        type=int,
+        default=1,
+        help='The minimum sequence number of a batch data. '
+        '(default: %(default)d)')
+    parser.add_argument(
+        '--frame_dim',
+        type=int,
+        default=120 * 11,
+        help='Frame dimension of feature data. (default: %(default)d)')
+    parser.add_argument(
+        '--stacked_num',
+        type=int,
+        default=5,
+        help='Number of lstmp layers to stack. (default: %(default)d)')
+    parser.add_argument(
+        '--proj_dim',
+        type=int,
+        default=512,
+        help='Project size of lstmp unit. (default: %(default)d)')
+    parser.add_argument(
+        '--hidden_dim',
+        type=int,
+        default=1024,
+        help='Hidden size of lstmp unit. (default: %(default)d)')
+    parser.add_argument(
+        '--class_num',
+        type=int,
+        default=1749,
+        help='Number of classes in label. (default: %(default)d)')
+    parser.add_argument(
+        '--learning_rate',
+        type=float,
+        default=0.00016,
+        help='Learning rate used to train. (default: %(default)f)')
+    parser.add_argument(
+        '--device',
+        type=str,
+        default='GPU',
+        choices=['CPU', 'GPU'],
+        help='The device type. (default: %(default)s)')
+    parser.add_argument(
+        '--parallel', action='store_true', help='If set, run in parallel.')
+    parser.add_argument(
+        '--mean_var',
+        type=str,
+        default='data/global_mean_var_search26kHr',
+        help="The path for feature's global mean and variance. "
+        "(default: %(default)s)")
+    parser.add_argument(
+        '--infer_feature_lst',
+        type=str,
+        default='data/infer_feature.lst',
+        help='The feature list path for inference. (default: %(default)s)')
+    parser.add_argument(
+        '--infer_label_lst',
+        type=str,
+        default='data/infer_label.lst',
+        help='The label list path for inference. (default: %(default)s)')
+    parser.add_argument(
+        '--checkpoint',
+        type=str,
+        default='./checkpoint',
+        help="The checkpoint path to init model. (default: %(default)s)")
+    parser.add_argument(
+        '--vocabulary',
+        type=str,
+        default='./decoder/graph/words.txt',
+        help="The path to vocabulary. (default: %(default)s)")
+    parser.add_argument(
+        '--graphs',
+        type=str,
+        default='./decoder/graph/TLG.fst',
+        help="The path to TLG graphs for decoding. (default: %(default)s)")
+    parser.add_argument(
+        '--log_prior',
+        type=str,
+        default="./decoder/logprior",
+        help="The log prior probs for training data. (default: %(default)s)")
+    args = parser.parse_args()
+    return args
+
+
+def print_arguments(args):
+    print('-----------  Configuration Arguments -----------')
+    for arg, value in sorted(vars(args).iteritems()):
+        print('%s: %s' % (arg, value))
+    print('------------------------------------------------')
+
+
+def infer_from_ckpt(args):
+    """Inference by using checkpoint."""
+
+    if not os.path.exists(args.checkpoint):
+        raise IOError("Invalid checkpoint!")
+
+    prediction, avg_cost, accuracy = stacked_lstmp_model(
+        frame_dim=args.frame_dim,
+        hidden_dim=args.hidden_dim,
+        proj_dim=args.proj_dim,
+        stacked_num=args.stacked_num,
+        class_num=args.class_num,
+        parallel=args.parallel)
+
+    infer_program = fluid.default_main_program().clone()
+
+    optimizer = fluid.optimizer.Adam(learning_rate=args.learning_rate)
+    optimizer.minimize(avg_cost)
+
+    place = fluid.CPUPlace() if args.device == 'CPU' else fluid.CUDAPlace(0)
+    exe = fluid.Executor(place)
+    exe.run(fluid.default_startup_program())
+
+    # load checkpoint.
+    fluid.io.load_persistables(exe, args.checkpoint)
+
+    ltrans = [
+        trans_add_delta.TransAddDelta(2, 2),
+        trans_mean_variance_norm.TransMeanVarianceNorm(args.mean_var),
+        trans_splice.TransSplice()
+    ]
+
+    feature_t = fluid.LoDTensor()
+    label_t = fluid.LoDTensor()
+
+    # infer data reader
+    infer_data_reader = reader.AsyncDataReader(args.infer_feature_lst,
+                                               args.infer_label_lst)
+    infer_data_reader.set_transformers(ltrans)
+    infer_costs, infer_accs = [], []
+    for batch_id, batch_data in enumerate(
+            infer_data_reader.batch_iterator(args.batch_size,
+                                             args.minimum_batch_size)):
+        # load_data
+        (features, labels, lod) = batch_data
+        feature_t.set(features.ndarray, place)
+        feature_t.set_lod([lod.ndarray])
+        label_t.set(labels.ndarray, place)
+        label_t.set_lod([lod.ndarray])
+
+        infer_data_reader.recycle(features, labels, lod)
+
+        results = exe.run(infer_program,
+                          feed={"feature": feature_t,
+                                "label": label_t},
+                          fetch_list=[prediction, avg_cost, accuracy],
+                          return_numpy=False)
+        infer_costs.append(lodtensor_to_ndarray(results[1])[0])
+        infer_accs.append(lodtensor_to_ndarray(results[2])[0])
+
+        probs, lod = lodtensor_to_ndarray(results[0])
+        infer_batch = split_infer_result(probs, lod)
+        for index, sample in enumerate(infer_batch):
+            key = "utter#%d" % (batch_id * args.batch_size + index)
+            print(key, ": ", decoder.decode(key, sample), "\n")
+
+    print(np.mean(infer_costs), np.mean(infer_accs))
+
+
+if __name__ == '__main__':
+    args = parse_args()
+    print_arguments(args)
+
+    infer_from_ckpt(args)
diff --git a/fluid/DeepASR/model_utils/__init__.py b/fluid/DeepASR/model_utils/__init__.py
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/fluid/DeepASR/model_utils/model.py b/fluid/DeepASR/model_utils/model.py
index 4e29394a0a..8fb7596e12 100644
--- a/fluid/DeepASR/model_utils/model.py
+++ b/fluid/DeepASR/model_utils/model.py
@@ -3,10 +3,11 @@
 from __future__ import print_function
 
 import paddle.v2 as paddle
-import paddle.v2.fluid as fluid
+import paddle.fluid as fluid
 
 
-def stacked_lstmp_model(hidden_dim,
+def stacked_lstmp_model(frame_dim,
+                        hidden_dim,
                         proj_dim,
                         stacked_num,
                         class_num,
@@ -20,12 +21,13 @@ def stacked_lstmp_model(hidden_dim,
         label data respectively. And in inference, only `feature` is needed.
 
     Args:
-	hidden_dim(int): The hidden state's dimension of the LSTMP layer.
-	proj_dim(int): The projection size of the LSTMP layer.
-	stacked_num(int): The number of stacked LSTMP layers.
-	parallel(bool): Run in parallel or not, default `False`.
-	is_train(bool): Run in training phase or not, default `True`.
-	class_dim(int): The number of output classes.
+        frame_dim(int): The frame dimension of feature data.
+        hidden_dim(int): The hidden state's dimension of the LSTMP layer.
+        proj_dim(int): The projection size of the LSTMP layer.
+        stacked_num(int): The number of stacked LSTMP layers.
+        parallel(bool): Run in parallel or not, default `False`.
+        is_train(bool): Run in training phase or not, default `True`.
+        class_dim(int): The number of output classes.
     """
 
     # network configuration
@@ -78,7 +80,7 @@ def _net_conf(feature, label):
 
     # data feeder
     feature = fluid.layers.data(
-        name="feature", shape=[-1, 120 * 11], dtype="float32", lod_level=1)
+        name="feature", shape=[-1, frame_dim], dtype="float32", lod_level=1)
     label = fluid.layers.data(
         name="label", shape=[-1, 1], dtype="int64", lod_level=1)
 
@@ -92,11 +94,12 @@ def _net_conf(feature, label):
             feat_ = pd.read_input(feature)
             label_ = pd.read_input(label)
             prediction, avg_cost, acc = _net_conf(feat_, label_)
-            for out in [avg_cost, acc]:
+            for out in [prediction, avg_cost, acc]:
                 pd.write_output(out)
 
         # get mean loss and acc through every devices.
-        avg_cost, acc = pd()
+        prediction, avg_cost, acc = pd()
+        prediction.stop_gradient = True
         avg_cost = fluid.layers.mean(x=avg_cost)
         acc = fluid.layers.mean(x=acc)
     else:
diff --git a/fluid/DeepASR/tools/profile.py b/fluid/DeepASR/tools/profile.py
index a45ce98acb..cf73294453 100644
--- a/fluid/DeepASR/tools/profile.py
+++ b/fluid/DeepASR/tools/profile.py
@@ -7,13 +7,13 @@
 import argparse
 import time
 
-import paddle.v2.fluid as fluid
-import paddle.v2.fluid.profiler as profiler
+import paddle.fluid as fluid
+import paddle.fluid.profiler as profiler
 import _init_paths
 import data_utils.augmentor.trans_mean_variance_norm as trans_mean_variance_norm
 import data_utils.augmentor.trans_add_delta as trans_add_delta
 import data_utils.augmentor.trans_splice as trans_splice
-import data_utils.data_reader as reader
+import data_utils.async_data_reader as reader
 from model_utils.model import stacked_lstmp_model
 from data_utils.util import lodtensor_to_ndarray
 
@@ -31,6 +31,11 @@ def parse_args():
         default=1,
         help='The minimum sequence number of a batch data. '
         '(default: %(default)d)')
+    parser.add_argument(
+        '--frame_dim',
+        type=int,
+        default=120 * 11,
+        help='Frame dimension of feature data. (default: %(default)d)')
     parser.add_argument(
         '--stacked_num',
         type=int,
@@ -46,10 +51,15 @@ def parse_args():
         type=int,
         default=1024,
         help='Hidden size of lstmp unit. (default: %(default)d)')
+    parser.add_argument(
+        '--class_num',
+        type=int,
+        default=1749,
+        help='Number of classes in label. (default: %(default)d)')
     parser.add_argument(
         '--learning_rate',
         type=float,
-        default=0.002,
+        default=0.00016,
         help='Learning rate used to train. (default: %(default)f)')
     parser.add_argument(
         '--device',
@@ -119,14 +129,15 @@ def profile(args):
             "arg 'first_batches_to_skip' must not be smaller than 0.")
 
     _, avg_cost, accuracy = stacked_lstmp_model(
+        frame_dim=args.frame_dim,
         hidden_dim=args.hidden_dim,
         proj_dim=args.proj_dim,
         stacked_num=args.stacked_num,
-        class_num=1749,
+        class_num=args.class_num,
         parallel=args.parallel)
 
-    adam_optimizer = fluid.optimizer.Adam(learning_rate=args.learning_rate)
-    adam_optimizer.minimize(avg_cost)
+    optimizer = fluid.optimizer.Adam(learning_rate=args.learning_rate)
+    optimizer.minimize(avg_cost)
 
     place = fluid.CPUPlace() if args.device == 'CPU' else fluid.CUDAPlace(0)
     exe = fluid.Executor(place)
@@ -138,7 +149,7 @@ def profile(args):
         trans_splice.TransSplice()
     ]
 
-    data_reader = reader.DataReader(args.feature_lst, args.label_lst)
+    data_reader = reader.AsyncDataReader(args.feature_lst, args.label_lst)
     data_reader.set_transformers(ltrans)
 
     feature_t = fluid.LoDTensor()
@@ -158,17 +169,20 @@ def profile(args):
                 frames_seen = 0
             # load_data
             (features, labels, lod) = batch_data
-            feature_t.set(features, place)
-            feature_t.set_lod([lod])
-            label_t.set(labels, place)
-            label_t.set_lod([lod])
+            feature_t.set(features.ndarray, place)
+            feature_t.set_lod([lod.ndarray])
+            label_t.set(labels.ndarray, place)
+            label_t.set_lod([lod.ndarray])
+
+            frames_seen += lod.ndarray[-1]
 
-            frames_seen += lod[-1]
+            data_reader.recycle(features, labels, lod)
 
             outs = exe.run(fluid.default_main_program(),
                            feed={"feature": feature_t,
                                  "label": label_t},
-                           fetch_list=[avg_cost, accuracy],
+                           fetch_list=[avg_cost, accuracy]
+                           if args.print_train_acc else [],
                            return_numpy=False)
 
             if args.print_train_acc:
diff --git a/fluid/DeepASR/train.py b/fluid/DeepASR/train.py
index 1c45f0a086..446e9e0ab1 100644
--- a/fluid/DeepASR/train.py
+++ b/fluid/DeepASR/train.py
@@ -8,11 +8,11 @@
 import argparse
 import time
 
-import paddle.v2.fluid as fluid
+import paddle.fluid as fluid
 import data_utils.augmentor.trans_mean_variance_norm as trans_mean_variance_norm
 import data_utils.augmentor.trans_add_delta as trans_add_delta
 import data_utils.augmentor.trans_splice as trans_splice
-import data_utils.data_reader as reader
+import data_utils.async_data_reader as reader
 from data_utils.util import lodtensor_to_ndarray
 from model_utils.model import stacked_lstmp_model
 
@@ -30,21 +30,31 @@ def parse_args():
         default=1,
         help='The minimum sequence number of a batch data. '
         '(default: %(default)d)')
+    parser.add_argument(
+        '--frame_dim',
+        type=int,
+        default=120 * 11,
+        help='Frame dimension of feature data. (default: %(default)d)')
     parser.add_argument(
         '--stacked_num',
         type=int,
         default=5,
-        help='Number of lstm layers to stack. (default: %(default)d)')
+        help='Number of lstmp layers to stack. (default: %(default)d)')
     parser.add_argument(
         '--proj_dim',
         type=int,
         default=512,
-        help='Project size of lstm unit. (default: %(default)d)')
+        help='Project size of lstmp unit. (default: %(default)d)')
     parser.add_argument(
         '--hidden_dim',
         type=int,
         default=1024,
-        help='Hidden size of lstm unit. (default: %(default)d)')
+        help='Hidden size of lstmp unit. (default: %(default)d)')
+    parser.add_argument(
+        '--class_num',
+        type=int,
+        default=1749,
+        help='Number of classes in label. (default: %(default)d)')
     parser.add_argument(
         '--pass_num',
         type=int,
@@ -58,7 +68,7 @@ def parse_args():
     parser.add_argument(
         '--learning_rate',
         type=float,
-        default=0.002,
+        default=0.00016,
         help='Learning rate used to train. (default: %(default)f)')
     parser.add_argument(
         '--device',
@@ -72,33 +82,46 @@ def parse_args():
         '--mean_var',
         type=str,
         default='data/global_mean_var_search26kHr',
-        help='mean var path')
+        help="The path for feature's global mean and variance. "
+        "(default: %(default)s)")
     parser.add_argument(
         '--train_feature_lst',
         type=str,
         default='data/feature.lst',
-        help='feature list path for training.')
+        help='The feature list path for training. (default: %(default)s)')
     parser.add_argument(
         '--train_label_lst',
         type=str,
         default='data/label.lst',
-        help='label list path for training.')
+        help='The label list path for training. (default: %(default)s)')
     parser.add_argument(
         '--val_feature_lst',
         type=str,
         default='data/val_feature.lst',
-        help='feature list path for validation.')
+        help='The feature list path for validation. (default: %(default)s)')
     parser.add_argument(
         '--val_label_lst',
         type=str,
         default='data/val_label.lst',
-        help='label list path for validation.')
+        help='The label list path for validation. (default: %(default)s)')
+    parser.add_argument(
+        '--init_model_path',
+        type=str,
+        default=None,
+        help="The model (checkpoint) path which the training resumes from. "
+        "If None, train the model from scratch. (default: %(default)s)")
     parser.add_argument(
-        '--model_save_dir',
+        '--checkpoints',
         type=str,
         default='./checkpoints',
-        help='directory to save model. Do not save model if set to '
-        '.')
+        help="The directory for saving checkpoints. Do not save checkpoints "
+        "if set to ''. (default: %(default)s)")
+    parser.add_argument(
+        '--infer_models',
+        type=str,
+        default='./infer_models',
+        help="The directory for saving inference models. Do not save inference "
+        "models if set to ''. (default: %(default)s)")
     args = parser.parse_args()
     return args
 
@@ -114,27 +137,37 @@ def train(args):
     """train in loop.
     """
 
-    # prediction, avg_cost, accuracy = stacked_lstmp_model(args.hidden_dim, 
-    #    args.proj_dim, args.stacked_num, class_num=1749, args.parallel)
+    # paths check
+    if args.init_model_path is not None and \
+            not os.path.exists(args.init_model_path):
+        raise IOError("Invalid initial model path!")
+    if args.checkpoints != '' and not os.path.exists(args.checkpoints):
+        os.mkdir(args.checkpoints)
+    if args.infer_models != '' and not os.path.exists(args.infer_models):
+        os.mkdir(args.infer_models)
+
     prediction, avg_cost, accuracy = stacked_lstmp_model(
+        frame_dim=args.frame_dim,
         hidden_dim=args.hidden_dim,
         proj_dim=args.proj_dim,
         stacked_num=args.stacked_num,
-        class_num=1749,
+        class_num=args.class_num,
         parallel=args.parallel)
 
-    adam_optimizer = fluid.optimizer.Adam(learning_rate=args.learning_rate)
-    adam_optimizer.minimize(avg_cost)
-
     # program for test
     test_program = fluid.default_main_program().clone()
-    with fluid.program_guard(test_program):
-        test_program = fluid.io.get_inference_program([avg_cost, accuracy])
+
+    optimizer = fluid.optimizer.Adam(learning_rate=args.learning_rate)
+    optimizer.minimize(avg_cost)
 
     place = fluid.CPUPlace() if args.device == 'CPU' else fluid.CUDAPlace(0)
     exe = fluid.Executor(place)
     exe.run(fluid.default_startup_program())
 
+    # resume training if initial model provided.
+    if args.init_model_path is not None:
+        fluid.io.load_persistables(exe, args.init_model_path)
+
     ltrans = [
         trans_add_delta.TransAddDelta(2, 2),
         trans_mean_variance_norm.TransMeanVarianceNorm(args.mean_var),
@@ -151,8 +184,8 @@ def test(exe):
                 os.path.exists(args.val_label_lst)):
             return -1.0, -1.0
         # test data reader
-        test_data_reader = reader.DataReader(args.val_feature_lst,
-                                             args.val_label_lst)
+        test_data_reader = reader.AsyncDataReader(args.val_feature_lst,
+                                                  args.val_label_lst)
         test_data_reader.set_transformers(ltrans)
         test_costs, test_accs = [], []
         for batch_id, batch_data in enumerate(
@@ -160,10 +193,12 @@ def test(exe):
                                                 args.minimum_batch_size)):
             # load_data
             (features, labels, lod) = batch_data
-            feature_t.set(features, place)
-            feature_t.set_lod([lod])
-            label_t.set(labels, place)
-            label_t.set_lod([lod])
+            feature_t.set(features.ndarray, place)
+            feature_t.set_lod([lod.ndarray])
+            label_t.set(labels.ndarray, place)
+            label_t.set_lod([lod.ndarray])
+
+            test_data_reader.recycle(features, labels, lod)
 
             cost, acc = exe.run(test_program,
                                 feed={"feature": feature_t,
@@ -175,8 +210,8 @@ def test(exe):
         return np.mean(test_costs), np.mean(test_accs)
 
     # train data reader
-    train_data_reader = reader.DataReader(args.train_feature_lst,
-                                          args.train_label_lst)
+    train_data_reader = reader.AsyncDataReader(args.train_feature_lst,
+                                               args.train_label_lst, -1)
     train_data_reader.set_transformers(ltrans)
     # train
     for pass_id in xrange(args.pass_num):
@@ -186,30 +221,46 @@ def test(exe):
                                                  args.minimum_batch_size)):
             # load_data
             (features, labels, lod) = batch_data
-            feature_t.set(features, place)
-            feature_t.set_lod([lod])
-            label_t.set(labels, place)
-            label_t.set_lod([lod])
+            feature_t.set(features.ndarray, place)
+            feature_t.set_lod([lod.ndarray])
+            label_t.set(labels.ndarray, place)
+            label_t.set_lod([lod.ndarray])
 
-            cost, acc = exe.run(fluid.default_main_program(),
-                                feed={"feature": feature_t,
-                                      "label": label_t},
-                                fetch_list=[avg_cost, accuracy],
-                                return_numpy=False)
+            train_data_reader.recycle(features, labels, lod)
+
+            to_print = batch_id > 0 and (batch_id % args.print_per_batches == 0)
+            outs = exe.run(fluid.default_main_program(),
+                           feed={"feature": feature_t,
+                                 "label": label_t},
+                           fetch_list=[avg_cost, accuracy] if to_print else [],
+                           return_numpy=False)
 
-            if batch_id > 0 and (batch_id % args.print_per_batches == 0):
+            if to_print:
                 print("\nBatch %d, train cost: %f, train acc: %f" %
-                      (batch_id, lodtensor_to_ndarray(cost)[0],
-                       lodtensor_to_ndarray(acc)[0]))
+                      (batch_id, lodtensor_to_ndarray(outs[0])[0],
+                       lodtensor_to_ndarray(outs[1])[0]))
+                # save the latest checkpoint
+                if args.checkpoints != '':
+                    model_path = os.path.join(args.checkpoints,
+                                              "deep_asr.latest.checkpoint")
+                    fluid.io.save_persistables(exe, model_path)
             else:
                 sys.stdout.write('.')
                 sys.stdout.flush()
         # run test
         val_cost, val_acc = test(exe)
-        # save model 
-        if args.model_save_dir != '':
+
+        # save checkpoint per pass
+        if args.checkpoints != '':
             model_path = os.path.join(
-                args.model_save_dir, "deep_asr.pass_" + str(pass_id) + ".model")
+                args.checkpoints,
+                "deep_asr.pass_" + str(pass_id) + ".checkpoint")
+            fluid.io.save_persistables(exe, model_path)
+        # save inference model
+        if args.infer_models != '':
+            model_path = os.path.join(
+                args.infer_models,
+                "deep_asr.pass_" + str(pass_id) + ".infer.model")
             fluid.io.save_inference_model(model_path, ["feature"],
                                           [prediction], exe)
         # cal pass time
@@ -224,7 +275,4 @@ def test(exe):
     args = parse_args()
     print_arguments(args)
 
-    if args.model_save_dir != '' and not os.path.exists(args.model_save_dir):
-        os.mkdir(args.model_save_dir)
-
     train(args)
diff --git a/fluid/README.md b/fluid/README.md
index e69de29bb2..88357ced18 100644
--- a/fluid/README.md
+++ b/fluid/README.md
@@ -0,0 +1,5 @@
+# Paddle Fluid Models
+
+---
+
+The Paddle Fluid models are a collection of example models that use Paddle Fluid APIs. Currently, example codes in this directory are still under active development.
diff --git a/fluid/adversarial/README.md b/fluid/adversarial/README.md
index 51da21918a..e052361c2a 100644
--- a/fluid/adversarial/README.md
+++ b/fluid/adversarial/README.md
@@ -1,3 +1,7 @@
+The minimum PaddlePaddle version needed for the code sample in this directory is the lastest develop branch. If you are on a version of PaddlePaddle earlier than this, [please update your installation](http://www.paddlepaddle.org/docs/develop/documentation/en/build_and_install/pip_install_en.html).
+
+---
+
 # Advbox
 
 Advbox is a Python toolbox to create adversarial examples that fool neural networks. It requires Python and paddle.
diff --git a/fluid/adversarial/advbox/__init__.py b/fluid/adversarial/advbox/__init__.py
index de124bad2e..e68b585ef9 100644
--- a/fluid/adversarial/advbox/__init__.py
+++ b/fluid/adversarial/advbox/__init__.py
@@ -1,7 +1,3 @@
 """
    A set of tools for generating adversarial example on paddle platform
 """
-
-from . import attacks
-from . import models
-from .adversary import Adversary
diff --git a/fluid/adversarial/advbox/adversary.py b/fluid/adversarial/advbox/adversary.py
index f044dfe8c9..14b8517e33 100644
--- a/fluid/adversarial/advbox/adversary.py
+++ b/fluid/adversarial/advbox/adversary.py
@@ -18,13 +18,15 @@ def __init__(self, original, original_label=None):
         """
         assert original is not None
 
+        self.original_label = original_label
+        self.target_label = None
+        self.adversarial_label = None
+
         self.__original = original
-        self.__original_label = original_label
-        self.__target_label = None
         self.__target = None
         self.__is_targeted_attack = False
         self.__adversarial_example = None
-        self.__adversarial_label = None
+        self.__bad_adversarial_example = None
 
     def set_target(self, is_targeted_attack, target=None, target_label=None):
         """
@@ -38,10 +40,10 @@ def set_target(self, is_targeted_attack, target=None, target_label=None):
         """
         assert (target_label is None) or is_targeted_attack
         self.__is_targeted_attack = is_targeted_attack
-        self.__target_label = target_label
+        self.target_label = target_label
         self.__target = target
         if not is_targeted_attack:
-            self.__target_label = None
+            self.target_label = None
             self.__target = None
 
     def set_original(self, original, original_label=None):
@@ -53,10 +55,11 @@ def set_original(self, original, original_label=None):
         """
         if original != self.__original:
             self.__original = original
-            self.__original_label = original_label
+            self.original_label = original_label
             self.__adversarial_example = None
+            self.__bad_adversarial_example = None
         if original is None:
-            self.__original_label = None
+            self.original_label = None
 
     def _is_successful(self, adversarial_label):
         """
@@ -65,11 +68,11 @@ def _is_successful(self, adversarial_label):
         :param adversarial_label: adversarial label.
         :return: bool
         """
-        if self.__target_label is not None:
-            return adversarial_label == self.__target_label
+        if self.target_label is not None:
+            return adversarial_label == self.target_label
         else:
             return (adversarial_label is not None) and \
-                   (adversarial_label != self.__original_label)
+                   (adversarial_label != self.original_label)
 
     def is_successful(self):
         """
@@ -77,7 +80,7 @@ def is_successful(self):
 
         :return: bool
         """
-        return self._is_successful(self.__adversarial_label)
+        return self._is_successful(self.adversarial_label)
 
     def try_accept_the_example(self, adversarial_example, adversarial_label):
         """
@@ -93,7 +96,9 @@ def try_accept_the_example(self, adversarial_example, adversarial_label):
         ok = self._is_successful(adversarial_label)
         if ok:
             self.__adversarial_example = adversarial_example
-            self.__adversarial_label = adversarial_label
+            self.adversarial_label = adversarial_label
+        else:
+            self.__bad_adversarial_example = adversarial_example
         return ok
 
     def perturbation(self, multiplying_factor=1.0):
@@ -104,9 +109,14 @@ def perturbation(self, multiplying_factor=1.0):
         :return: The perturbation that is multiplied by multiplying_factor.
         """
         assert self.__original is not None
-        assert self.__adversarial_example is not None
-        return multiplying_factor * (
-            self.__adversarial_example - self.__original)
+        assert (self.__adversarial_example is not None) or \
+               (self.__bad_adversarial_example is not None)
+        if self.__adversarial_example is not None:
+            return multiplying_factor * (
+                self.__adversarial_example - self.__original)
+        else:
+            return multiplying_factor * (
+                self.__bad_adversarial_example - self.__original)
 
     @property
     def is_targeted_attack(self):
@@ -115,20 +125,6 @@ def is_targeted_attack(self):
         """
         return self.__is_targeted_attack
 
-    @property
-    def target_label(self):
-        """
-        :property: target_label
-        """
-        return self.__target_label
-
-    @target_label.setter
-    def target_label(self, label):
-        """
-        :property: target_label
-        """
-        self.__target_label = label
-
     @property
     def target(self):
         """
@@ -143,20 +139,6 @@ def original(self):
         """
         return self.__original
 
-    @property
-    def original_label(self):
-        """
-        :property: original
-        """
-        return self.__original_label
-
-    @original_label.setter
-    def original_label(self, label):
-        """
-        original_label setter
-        """
-        self.__original_label = label
-
     @property
     def adversarial_example(self):
         """
@@ -164,23 +146,9 @@ def adversarial_example(self):
         """
         return self.__adversarial_example
 
-    @adversarial_example.setter
-    def adversarial_example(self, example):
-        """
-        adversarial_example setter
-        """
-        self.__adversarial_example = example
-
     @property
-    def adversarial_label(self):
-        """
-        :property: adversarial_label
-        """
-        return self.__adversarial_label
-
-    @adversarial_label.setter
-    def adversarial_label(self, label):
+    def bad_adversarial_example(self):
         """
-        adversarial_label setter
+        :property: bad_adversarial_example
         """
-        self.__adversarial_label = label
+        return self.__bad_adversarial_example
diff --git a/fluid/adversarial/advbox/attacks/__init__.py b/fluid/adversarial/advbox/attacks/__init__.py
index bafd123c67..3893b769f3 100644
--- a/fluid/adversarial/advbox/attacks/__init__.py
+++ b/fluid/adversarial/advbox/attacks/__init__.py
@@ -1,10 +1,3 @@
 """
-Attack methods
+Attack methods __init__.py
 """
-
-from .base import Attack
-from .deepfool import DeepFoolAttack
-from .gradientsign import FGSM
-from .gradientsign import GradientSignAttack
-from .iterator_gradientsign import IFGSM
-from .iterator_gradientsign import IteratorGradientSignAttack
diff --git a/fluid/adversarial/advbox/attacks/base.py b/fluid/adversarial/advbox/attacks/base.py
index eb9b1d480c..af2eae5e41 100644
--- a/fluid/adversarial/advbox/attacks/base.py
+++ b/fluid/adversarial/advbox/attacks/base.py
@@ -52,21 +52,23 @@ def _preprocess(self, adversary):
         :param adversary: adversary
         :return: None
         """
+        assert self.model.channel_axis() == adversary.original.ndim
+
         if adversary.original_label is None:
             adversary.original_label = np.argmax(
                 self.model.predict(adversary.original))
         if adversary.is_targeted_attack and adversary.target_label is None:
             if adversary.target is None:
                 raise ValueError(
-                    'When adversary.is_targeted_attack is True, '
+                    'When adversary.is_targeted_attack is true, '
                     'adversary.target_label or adversary.target must be set.')
             else:
-                adversary.target_label_label = np.argmax(
-                    self.model.predict(
-                        self.model.scale_input(adversary.target)))
+                adversary.target_label = np.argmax(
+                    self.model.predict(adversary.target))
 
-        logging.info('adversary:\noriginal_label: {}'
-                     '\n          target_lable: {}'
-                     '\n          is_targeted_attack: {}'
+        logging.info('adversary:'
+                     '\n         original_label: {}'
+                     '\n         target_label: {}'
+                     '\n         is_targeted_attack: {}'
                      ''.format(adversary.original_label, adversary.target_label,
                                adversary.is_targeted_attack))
diff --git a/fluid/adversarial/advbox/attacks/deepfool.py b/fluid/adversarial/advbox/attacks/deepfool.py
index 2f2da63059..abf2292cf3 100644
--- a/fluid/adversarial/advbox/attacks/deepfool.py
+++ b/fluid/adversarial/advbox/attacks/deepfool.py
@@ -10,6 +10,8 @@
 
 from .base import Attack
 
+__all__ = ['DeepFoolAttack']
+
 
 class DeepFoolAttack(Attack):
     """
@@ -56,7 +58,7 @@ def _apply(self, adversary, iterations=100, overshoot=0.02):
                 gradient_k = self.model.gradient(x, k)
                 w_k = gradient_k - gradient
                 f_k = f[k] - f[pre_label]
-                w_k_norm = np.linalg.norm(w_k) + 1e-8
+                w_k_norm = np.linalg.norm(w_k.flatten()) + 1e-8
                 pert_k = (np.abs(f_k) + 1e-8) / w_k_norm
                 if pert_k < pert:
                     pert = pert_k
@@ -70,9 +72,12 @@ def _apply(self, adversary, iterations=100, overshoot=0.02):
             f = self.model.predict(x)
             gradient = self.model.gradient(x, pre_label)
             adv_label = np.argmax(f)
-            logging.info('iteration = {}, f = {}, pre_label = {}'
-                         ', adv_label={}'.format(iteration, f[pre_label],
-                                                 pre_label, adv_label))
+            logging.info('iteration={}, f[pre_label]={}, f[target_label]={}'
+                         ', f[adv_label]={}, pre_label={}, adv_label={}'
+                         ''.format(iteration, f[pre_label], (
+                             f[adversary.target_label]
+                             if adversary.is_targeted_attack else 'NaN'), f[
+                                 adv_label], pre_label, adv_label))
             if adversary.try_accept_the_example(x, adv_label):
                 return adversary
 
diff --git a/fluid/adversarial/advbox/attacks/gradient_method.py b/fluid/adversarial/advbox/attacks/gradient_method.py
new file mode 100644
index 0000000000..25b828d412
--- /dev/null
+++ b/fluid/adversarial/advbox/attacks/gradient_method.py
@@ -0,0 +1,170 @@
+"""
+This module provide the attack method for Iterator FGSM's implement.
+"""
+from __future__ import division
+
+import logging
+from collections import Iterable
+
+import numpy as np
+
+from .base import Attack
+
+__all__ = [
+    'GradientMethodAttack', 'FastGradientSignMethodAttack', 'FGSM',
+    'FastGradientSignMethodTargetedAttack', 'FGSMT',
+    'BasicIterativeMethodAttack', 'BIM',
+    'IterativeLeastLikelyClassMethodAttack', 'ILCM'
+]
+
+
+class GradientMethodAttack(Attack):
+    """
+    This class implements gradient attack method, and is the base of FGSM, BIM,
+    ILCM, etc.
+    """
+
+    def __init__(self, model, support_targeted=True):
+        """
+        :param model(model): The model to be attacked.
+        :param support_targeted(bool): Does this attack method support targeted.
+        """
+        super(GradientMethodAttack, self).__init__(model)
+        self.support_targeted = support_targeted
+
+    def _apply(self, adversary, norm_ord=np.inf, epsilons=0.01, steps=100):
+        """
+        Apply the gradient attack method.
+        :param adversary(Adversary):
+            The Adversary object.
+        :param norm_ord(int):
+            Order of the norm, such as np.inf, 1, 2, etc. It can't be 0.
+        :param epsilons(list|tuple|int):
+            Attack step size (input variation).
+        :param steps:
+            The number of iterator steps.
+        :return:
+            adversary(Adversary): The Adversary object.
+        """
+        if norm_ord == 0:
+            raise ValueError("L0 norm is not supported!")
+
+        if not self.support_targeted:
+            if adversary.is_targeted_attack:
+                raise ValueError(
+                    "This attack method doesn't support targeted attack!")
+
+        if not isinstance(epsilons, Iterable):
+            epsilons = np.linspace(epsilons, epsilons + 1e-10, num=steps)
+
+        pre_label = adversary.original_label
+        min_, max_ = self.model.bounds()
+
+        assert self.model.channel_axis() == adversary.original.ndim
+        assert (self.model.channel_axis() == 1 or
+                self.model.channel_axis() == adversary.original.shape[0] or
+                self.model.channel_axis() == adversary.original.shape[-1])
+
+        step = 1
+        adv_img = adversary.original
+        for epsilon in epsilons[:steps]:
+            if epsilon == 0.0:
+                continue
+            if adversary.is_targeted_attack:
+                gradient = -self.model.gradient(adv_img, adversary.target_label)
+            else:
+                gradient = self.model.gradient(adv_img,
+                                               adversary.original_label)
+            if norm_ord == np.inf:
+                gradient_norm = np.sign(gradient)
+            else:
+                gradient_norm = gradient / self._norm(gradient, ord=norm_ord)
+
+            adv_img = adv_img + epsilon * gradient_norm * (max_ - min_)
+            adv_img = np.clip(adv_img, min_, max_)
+            adv_label = np.argmax(self.model.predict(adv_img))
+            logging.info('step={}, epsilon = {:.5f}, pre_label = {}, '
+                         'adv_label={}'.format(step, epsilon, pre_label,
+                                               adv_label))
+            if adversary.try_accept_the_example(adv_img, adv_label):
+                return adversary
+            step += 1
+        return adversary
+
+    @staticmethod
+    def _norm(a, ord):
+        if a.ndim == 1:
+            return np.linalg.norm(a, ord=ord)
+        if a.ndim == a.shape[0]:
+            norm_shape = (a.ndim, reduce(np.dot, a.shape[1:]))
+            norm_axis = 1
+        else:
+            norm_shape = (reduce(np.dot, a.shape[:-1]), a.ndim)
+            norm_axis = 0
+        return np.linalg.norm(a.reshape(norm_shape), ord=ord, axis=norm_axis)
+
+
+class FastGradientSignMethodTargetedAttack(GradientMethodAttack):
+    """
+    "Fast Gradient Sign Method" is extended to support targeted attack.
+    "Fast Gradient Sign Method" was originally implemented by Goodfellow et
+    al. (2015) with the infinity norm.
+
+    Paper link: https://arxiv.org/abs/1412.6572
+    """
+
+    def _apply(self, adversary, epsilons=0.03):
+        return GradientMethodAttack._apply(
+            self,
+            adversary=adversary,
+            norm_ord=np.inf,
+            epsilons=epsilons,
+            steps=1)
+
+
+class FastGradientSignMethodAttack(FastGradientSignMethodTargetedAttack):
+    """
+    This attack was originally implemented by Goodfellow et al. (2015) with the
+    infinity norm, and is known as the "Fast Gradient Sign Method".
+
+    Paper link: https://arxiv.org/abs/1412.6572
+    """
+
+    def __init__(self, model):
+        super(FastGradientSignMethodAttack, self).__init__(model, False)
+
+
+class IterativeLeastLikelyClassMethodAttack(GradientMethodAttack):
+    """
+    "Iterative Least-likely Class Method (ILCM)" extends "BIM" to support
+    targeted attack.
+    "The Basic Iterative Method (BIM)" is to extend "FSGM". "BIM" iteratively
+    take multiple small steps while adjusting the direction after each step.
+
+    Paper link: https://arxiv.org/abs/1607.02533
+    """
+
+    def _apply(self, adversary, epsilons=0.001, steps=1000):
+        return GradientMethodAttack._apply(
+            self,
+            adversary=adversary,
+            norm_ord=np.inf,
+            epsilons=epsilons,
+            steps=steps)
+
+
+class BasicIterativeMethodAttack(IterativeLeastLikelyClassMethodAttack):
+    """
+    FGSM is a one-step method. "The Basic Iterative Method (BIM)" iteratively
+    take multiple small steps while adjusting the direction after each step.
+    Paper link: https://arxiv.org/abs/1607.02533
+    """
+
+    def __init__(self, model):
+        super(BasicIterativeMethodAttack, self).__init__(model, False)
+
+
+FGSM = FastGradientSignMethodAttack
+FGSMT = FastGradientSignMethodTargetedAttack
+BIM = BasicIterativeMethodAttack
+ILCM = IterativeLeastLikelyClassMethodAttack
diff --git a/fluid/adversarial/advbox/attacks/gradientsign.py b/fluid/adversarial/advbox/attacks/gradientsign.py
deleted file mode 100644
index 5909fef5c8..0000000000
--- a/fluid/adversarial/advbox/attacks/gradientsign.py
+++ /dev/null
@@ -1,60 +0,0 @@
-"""
-This module provide the attack method for FGSM's implement.
-"""
-from __future__ import division
-
-import logging
-from collections import Iterable
-
-import numpy as np
-
-from .base import Attack
-
-
-class GradientSignAttack(Attack):
-    """
-    This attack was originally implemented by Goodfellow et al. (2015) with the
-    infinity norm (and is known as the "Fast Gradient Sign Method").
-    This is therefore called the Fast Gradient Method.
-    Paper link: https://arxiv.org/abs/1412.6572
-    """
-
-    def _apply(self, adversary, epsilons=1000):
-        """
-          Apply the gradient sign attack.
-          Args:
-              adversary(Adversary): The Adversary object.
-              epsilons(list|tuple|int): The epsilon (input variation parameter).
-          Return:
-              adversary: The Adversary object.
-          """
-        assert adversary is not None
-
-        if not isinstance(epsilons, Iterable):
-            epsilons = np.linspace(0, 1, num=epsilons + 1)[1:]
-
-        pre_label = adversary.original_label
-        min_, max_ = self.model.bounds()
-
-        if adversary.is_targeted_attack:
-            gradient = self.model.gradient(adversary.original,
-                                           adversary.target_label)
-            gradient_sign = -np.sign(gradient) * (max_ - min_)
-        else:
-            gradient = self.model.gradient(adversary.original,
-                                           adversary.original_label)
-            gradient_sign = np.sign(gradient) * (max_ - min_)
-
-        for epsilon in epsilons:
-            adv_img = adversary.original + epsilon * gradient_sign
-            adv_img = np.clip(adv_img, min_, max_)
-            adv_label = np.argmax(self.model.predict(adv_img))
-            logging.info('epsilon = {:.3f}, pre_label = {}, adv_label={}'.
-                         format(epsilon, pre_label, adv_label))
-            if adversary.try_accept_the_example(adv_img, adv_label):
-                return adversary
-
-        return adversary
-
-
-FGSM = GradientSignAttack
diff --git a/fluid/adversarial/advbox/attacks/iterator_gradientsign.py b/fluid/adversarial/advbox/attacks/iterator_gradientsign.py
deleted file mode 100644
index ac2ef8142a..0000000000
--- a/fluid/adversarial/advbox/attacks/iterator_gradientsign.py
+++ /dev/null
@@ -1,59 +0,0 @@
-"""
-This module provide the attack method for Iterator FGSM's implement.
-"""
-from __future__ import division
-
-import logging
-from collections import Iterable
-
-import numpy as np
-
-from .base import Attack
-
-
-class IteratorGradientSignAttack(Attack):
-    """
-    This attack was originally implemented by Alexey Kurakin(Google Brain).
-    Paper link: https://arxiv.org/pdf/1607.02533.pdf
-    """
-
-    def _apply(self, adversary, epsilons=100, steps=10):
-        """
-        Apply the iterative gradient sign attack.
-        Args:
-            adversary(Adversary): The Adversary object.
-            epsilons(list|tuple|int): The epsilon (input variation parameter).
-            steps(int): The number of iterator steps.
-        Return:
-            adversary(Adversary): The Adversary object.
-        """
-
-        if not isinstance(epsilons, Iterable):
-            epsilons = np.linspace(0, 1 / steps, num=epsilons + 1)[1:]
-
-        pre_label = adversary.original_label
-        min_, max_ = self.model.bounds()
-
-        for epsilon in epsilons:
-            adv_img = adversary.original
-            for _ in range(steps):
-                if adversary.is_targeted_attack:
-                    gradient = self.model.gradient(adversary.original,
-                                                   adversary.target_label)
-                    gradient_sign = -np.sign(gradient) * (max_ - min_)
-                else:
-                    gradient = self.model.gradient(adversary.original,
-                                                   adversary.original_label)
-                    gradient_sign = np.sign(gradient) * (max_ - min_)
-                adv_img = adv_img + gradient_sign * epsilon
-                adv_img = np.clip(adv_img, min_, max_)
-                adv_label = np.argmax(self.model.predict(adv_img))
-                logging.info('epsilon = {:.3f}, pre_label = {}, adv_label={}'.
-                             format(epsilon, pre_label, adv_label))
-                if adversary.try_accept_the_example(adv_img, adv_label):
-                    return adversary
-
-        return adversary
-
-
-IFGSM = IteratorGradientSignAttack
diff --git a/fluid/adversarial/advbox/attacks/lbfgs.py b/fluid/adversarial/advbox/attacks/lbfgs.py
new file mode 100644
index 0000000000..b427df1d97
--- /dev/null
+++ b/fluid/adversarial/advbox/attacks/lbfgs.py
@@ -0,0 +1,138 @@
+"""
+This module provide the attack method of "LBFGS".
+"""
+from __future__ import division
+
+import logging
+
+import numpy as np
+from scipy.optimize import fmin_l_bfgs_b
+
+from .base import Attack
+
+__all__ = ['LBFGSAttack', 'LBFGS']
+
+
+class LBFGSAttack(Attack):
+    """
+    Uses L-BFGS-B to minimize the cross-entropy and the distance between the
+    original and the adversary.
+
+    Paper link: https://arxiv.org/abs/1510.05328
+    """
+
+    def __init__(self, model):
+        super(LBFGSAttack, self).__init__(model)
+        self._predicts_normalized = None
+        self._adversary = None  # type: Adversary
+
+    def _apply(self, adversary, epsilon=0.001, steps=10):
+        self._adversary = adversary
+
+        if not adversary.is_targeted_attack:
+            raise ValueError("This attack method only support targeted attack!")
+
+        # finding initial c
+        logging.info('finding initial c...')
+        c = epsilon
+        x0 = adversary.original.flatten()
+        for i in range(30):
+            c = 2 * c
+            logging.info('c={}'.format(c))
+            is_adversary = self._lbfgsb(x0, c, steps)
+            if is_adversary:
+                break
+        if not is_adversary:
+            logging.info('Failed!')
+            return adversary
+
+        # binary search c
+        logging.info('binary search c...')
+        c_low = 0
+        c_high = c
+        while c_high - c_low >= epsilon:
+            logging.info('c_high={}, c_low={}, diff={}, epsilon={}'
+                         .format(c_high, c_low, c_high - c_low, epsilon))
+            c_half = (c_low + c_high) / 2
+            is_adversary = self._lbfgsb(x0, c_half, steps)
+            if is_adversary:
+                c_high = c_half
+            else:
+                c_low = c_half
+
+        return adversary
+
+    def _is_predicts_normalized(self, predicts):
+        """
+        To determine the predicts is normalized.
+        :param predicts(np.array): the output of the model.
+        :return: bool
+        """
+        if self._predicts_normalized is None:
+            if self.model.predict_name().lower() in [
+                    'softmax', 'probabilities', 'probs'
+            ]:
+                self._predicts_normalized = True
+            else:
+                if np.any(predicts < 0.0):
+                    self._predicts_normalized = False
+                else:
+                    s = np.sum(predicts.flatten())
+                    if 0.999 <= s <= 1.001:
+                        self._predicts_normalized = True
+                    else:
+                        self._predicts_normalized = False
+        assert self._predicts_normalized is not None
+        return self._predicts_normalized
+
+    def _loss(self, adv_x, c):
+        """
+        To get the loss and gradient.
+        :param adv_x: the candidate adversarial example
+        :param c: parameter 'C' in the paper
+        :return: (loss, gradient)
+        """
+        x = adv_x.reshape(self._adversary.original.shape)
+
+        # cross_entropy
+        logits = self.model.predict(x)
+        if not self._is_predicts_normalized(logits):  # to softmax
+            e = np.exp(logits)
+            logits = e / np.sum(e)
+        e = np.exp(logits)
+        s = np.sum(e)
+        ce = np.log(s) - logits[self._adversary.target_label]
+
+        # L2 distance
+        min_, max_ = self.model.bounds()
+        d = np.sum((x - self._adversary.original).flatten() ** 2) \
+            / ((max_ - min_) ** 2) / len(adv_x)
+
+        # gradient
+        gradient = self.model.gradient(x, self._adversary.target_label)
+
+        result = (c * ce + d).astype(float), gradient.flatten().astype(float)
+        return result
+
+    def _lbfgsb(self, x0, c, maxiter):
+        min_, max_ = self.model.bounds()
+        bounds = [(min_, max_)] * len(x0)
+        approx_grad_eps = (max_ - min_) / 100.0
+        x, f, d = fmin_l_bfgs_b(
+            self._loss,
+            x0,
+            args=(c, ),
+            bounds=bounds,
+            maxiter=maxiter,
+            epsilon=approx_grad_eps)
+        if np.amax(x) > max_ or np.amin(x) < min_:
+            x = np.clip(x, min_, max_)
+        shape = self._adversary.original.shape
+        adv_label = np.argmax(self.model.predict(x.reshape(shape)))
+        logging.info('pre_label = {}, adv_label={}'.format(
+            self._adversary.target_label, adv_label))
+        return self._adversary.try_accept_the_example(
+            x.reshape(shape), adv_label)
+
+
+LBFGS = LBFGSAttack
diff --git a/fluid/adversarial/advbox/attacks/saliency.py b/fluid/adversarial/advbox/attacks/saliency.py
new file mode 100644
index 0000000000..3179f0ffe6
--- /dev/null
+++ b/fluid/adversarial/advbox/attacks/saliency.py
@@ -0,0 +1,146 @@
+"""
+This module provide the attack method for JSMA's implement.
+"""
+from __future__ import division
+
+import logging
+import random
+import numpy as np
+
+from .base import Attack
+
+
+class SaliencyMapAttack(Attack):
+    """
+    Implements the Saliency Map Attack.
+    The Jacobian-based Saliency Map Approach (Papernot et al. 2016).
+    Paper link: https://arxiv.org/pdf/1511.07528.pdf
+    """
+
+    def _apply(self,
+               adversary,
+               max_iter=2000,
+               fast=True,
+               theta=0.1,
+               max_perturbations_per_pixel=7):
+        """
+        Apply the JSMA attack.
+        Args:
+            adversary(Adversary): The Adversary object.
+            max_iter(int): The max iterations.
+            fast(bool): Whether evaluate the pixel influence on sum of residual classes.
+            theta(float): Perturbation per pixel relative to [min, max] range.
+            max_perturbations_per_pixel(int): The max count of perturbation per pixel.
+        Return:
+            adversary: The Adversary object.
+        """
+        assert adversary is not None
+
+        if not adversary.is_targeted_attack or (adversary.target_label is None):
+            target_labels = self._generate_random_target(
+                adversary.original_label)
+        else:
+            target_labels = [adversary.target_label]
+
+        for target in target_labels:
+            original_image = adversary.original
+
+            # the mask defines the search domain
+            # each modified pixel with border value is set to zero in mask
+            mask = np.ones_like(original_image)
+
+            # count tracks how often each pixel was changed
+            counts = np.zeros_like(original_image)
+
+            labels = range(self.model.num_classes())
+            adv_img = original_image.copy()
+            min_, max_ = self.model.bounds()
+
+            for step in range(max_iter):
+                adv_img = np.clip(adv_img, min_, max_)
+                adv_label = np.argmax(self.model.predict(adv_img))
+                if adversary.try_accept_the_example(adv_img, adv_label):
+                    return adversary
+
+                # stop if mask is all zero
+                if not any(mask.flatten()):
+                    return adversary
+
+                logging.info('step = {}, original_label = {}, adv_label={}'.
+                             format(step, adversary.original_label, adv_label))
+
+                # get pixel location with highest influence on class
+                idx, p_sign = self._saliency_map(
+                    adv_img, target, labels, mask, fast=fast)
+
+                # apply perturbation
+                adv_img[idx] += -p_sign * theta * (max_ - min_)
+
+                # tracks number of updates for each pixel
+                counts[idx] += 1
+
+                # remove pixel from search domain if it hits the bound
+                if adv_img[idx] <= min_ or adv_img[idx] >= max_:
+                    mask[idx] = 0
+
+                # remove pixel if it was changed too often
+                if counts[idx] >= max_perturbations_per_pixel:
+                    mask[idx] = 0
+
+                adv_img = np.clip(adv_img, min_, max_)
+
+    def _generate_random_target(self, original_label):
+        """
+        Draw random target labels all of which are different and not the original label.
+        Args:
+            original_label(int): Original label.
+        Return:
+            target_labels(list): random target labels
+        """
+        num_random_target = 1
+        num_classes = self.model.num_classes()
+        assert num_random_target <= num_classes - 1
+
+        target_labels = random.sample(range(num_classes), num_random_target + 1)
+        target_labels = [t for t in target_labels if t != original_label]
+        target_labels = target_labels[:num_random_target]
+
+        return target_labels
+
+    def _saliency_map(self, image, target, labels, mask, fast=False):
+        """
+        Get pixel location with highest influence on class.
+        Args:
+            image(numpy.ndarray): Image with shape (height, width, channels).
+            target(int): The target label.
+            labels(int): The number of classes of the output label.
+            mask(list): Each modified pixel with border value is set to zero in mask.
+            fast(bool): Whether evaluate the pixel influence on sum of residual classes.
+        Return:
+            idx: The index of optimal pixel.
+            pix_sign: The direction of perturbation
+        """
+        # pixel influence on target class
+        alphas = self.model.gradient(image, target) * mask
+
+        # pixel influence on sum of residual classes(don't evaluate if fast == True)
+        if fast:
+            betas = -np.ones_like(alphas)
+        else:
+            betas = np.sum([
+                self.model.gradient(image, label) * mask - alphas
+                for label in labels
+            ], 0)
+
+        # compute saliency map (take into account both pos. & neg. perturbations)
+        sal_map = np.abs(alphas) * np.abs(betas) * np.sign(alphas * betas)
+
+        # find optimal pixel & direction of perturbation
+        idx = np.argmin(sal_map)
+        idx = np.unravel_index(idx, mask.shape)
+        pix_sign = np.sign(alphas)[idx]
+
+        return idx, pix_sign
+
+
+JSMA = SaliencyMapAttack
diff --git a/fluid/adversarial/advbox/models/__init__.py b/fluid/adversarial/advbox/models/__init__.py
index 46d0fea90e..de6d2a9fee 100644
--- a/fluid/adversarial/advbox/models/__init__.py
+++ b/fluid/adversarial/advbox/models/__init__.py
@@ -1,5 +1,3 @@
 """
-Paddle model for target of attack
-"""
-from .base import Model
-from .paddle import PaddleModel
+Models __init__.py
+"""
\ No newline at end of file
diff --git a/fluid/adversarial/advbox/models/base.py b/fluid/adversarial/advbox/models/base.py
index 142c7f054a..f25d4e305d 100644
--- a/fluid/adversarial/advbox/models/base.py
+++ b/fluid/adversarial/advbox/models/base.py
@@ -24,11 +24,21 @@ def __init__(self, bounds, channel_axis, preprocess=None):
         assert len(bounds) == 2
         assert channel_axis in [0, 1, 2, 3]
 
-        if preprocess is None:
-            preprocess = (0, 1)
         self._bounds = bounds
         self._channel_axis = channel_axis
-        self._preprocess = preprocess
+
+        # Make self._preprocess to be (0,1) if possible, so that don't need
+        # to do substract or divide.
+        if preprocess is not None:
+            sub, div = np.array(preprocess)
+            if not np.any(sub):
+                sub = 0
+            if np.all(div == 1):
+                div = 1
+            assert (div is None) or np.all(div)
+            self._preprocess = (sub, div)
+        else:
+            self._preprocess = (0, 1)
 
     def bounds(self):
         """
@@ -47,8 +57,7 @@ def _process_input(self, input_):
         sub, div = self._preprocess
         if np.any(sub != 0):
             res = input_ - sub
-        assert np.any(div != 0)
-        if np.any(div != 1):
+        if not np.all(sub == 1):
             if res is None:  # "res = input_ - sub" is not executed!
                 res = input_ / div
             else:
@@ -97,3 +106,11 @@ def gradient(self, data, label):
                 with the shape (height, width, channel).
         """
         raise NotImplementedError
+
+    @abstractmethod
+    def predict_name(self):
+        """
+        Get the predict name, such as "softmax",etc.
+        :return: string
+        """
+        raise NotImplementedError
diff --git a/fluid/adversarial/advbox/models/paddle.py b/fluid/adversarial/advbox/models/paddle.py
index 3a25dba40a..73439d2a4e 100644
--- a/fluid/adversarial/advbox/models/paddle.py
+++ b/fluid/adversarial/advbox/models/paddle.py
@@ -4,7 +4,7 @@
 from __future__ import absolute_import
 
 import numpy as np
-import paddle.v2.fluid as fluid
+import paddle.fluid as fluid
 
 from .base import Model
 
@@ -16,7 +16,7 @@ class PaddleModel(Model):
     instance of PaddleModel.
 
     Args:
-        program(paddle.v2.fluid.framework.Program): The program of the model
+        program(paddle.fluid.framework.Program): The program of the model
             which generate the adversarial sample.
         input_name(string): The name of the input.
         logits_name(string): The name of the logits.
@@ -114,3 +114,10 @@ def gradient(self, data, label):
                               feed=feeder.feed([(scaled_data, label)]),
                               fetch_list=[self._gradient])
         return grad.reshape(data.shape)
+
+    def predict_name(self):
+        """
+        Get the predict name, such as "softmax",etc.
+        :return: string
+        """
+        return self._program.block(0).var(self._predict_name).op.type
diff --git a/fluid/adversarial/fluid_mnist.py b/fluid/adversarial/fluid_mnist.py
index db4d4b5186..edeb6b0269 100644
--- a/fluid/adversarial/fluid_mnist.py
+++ b/fluid/adversarial/fluid_mnist.py
@@ -2,7 +2,7 @@
 CNN on mnist data using fluid api of paddlepaddle
 """
 import paddle.v2 as paddle
-import paddle.v2.fluid as fluid
+import paddle.fluid as fluid
 
 
 def mnist_cnn_model(img):
@@ -47,7 +47,9 @@ def main():
     optimizer = fluid.optimizer.Adam(learning_rate=0.01)
     optimizer.minimize(avg_cost)
 
-    accuracy = fluid.evaluator.Accuracy(input=logits, label=label)
+    batch_size = fluid.layers.create_tensor(dtype='int64')
+    batch_acc = fluid.layers.accuracy(
+        input=logits, label=label, total=batch_size)
 
     BATCH_SIZE = 50
     PASS_NUM = 3
@@ -63,20 +65,22 @@ def main():
     feeder = fluid.DataFeeder(feed_list=[img, label], place=place)
     exe.run(fluid.default_startup_program())
 
+    pass_acc = fluid.average.WeightedAverage()
     for pass_id in range(PASS_NUM):
-        accuracy.reset(exe)
+        pass_acc.reset()
         for data in train_reader():
-            loss, acc = exe.run(fluid.default_main_program(),
-                                feed=feeder.feed(data),
-                                fetch_list=[avg_cost] + accuracy.metrics)
-            pass_acc = accuracy.eval(exe)
-            print("pass_id=" + str(pass_id) + " acc=" + str(acc) + " pass_acc="
-                  + str(pass_acc))
+            loss, acc, b_size = exe.run(
+                fluid.default_main_program(),
+                feed=feeder.feed(data),
+                fetch_list=[avg_cost, batch_acc, batch_size])
+            pass_acc.add(value=acc, weight=b_size)
+            print("pass_id=" + str(pass_id) + " acc=" + str(acc[0]) +
+                  " pass_acc=" + str(pass_acc.eval()[0]))
             if loss < LOSS_THRESHOLD and pass_acc > ACC_THRESHOLD:
                 break
 
-        pass_acc = accuracy.eval(exe)
-        print("pass_id=" + str(pass_id) + " pass_acc=" + str(pass_acc))
+        print("pass_id=" + str(pass_id) + " pass_acc=" + str(pass_acc.eval()[
+            0]))
     fluid.io.save_params(
         exe, dirname='./mnist', main_program=fluid.default_main_program())
     print('train mnist done')
diff --git a/fluid/adversarial/mnist_tutorial_fgsm.py b/fluid/adversarial/mnist_tutorial_fgsm.py
index 5da4bbfc43..ea3231695b 100644
--- a/fluid/adversarial/mnist_tutorial_fgsm.py
+++ b/fluid/adversarial/mnist_tutorial_fgsm.py
@@ -3,10 +3,10 @@
 """
 import matplotlib.pyplot as plt
 import paddle.v2 as paddle
-import paddle.v2.fluid as fluid
+import paddle.fluid as fluid
 
-from advbox import Adversary
-from advbox.attacks.gradientsign import GradientSignAttack
+from advbox.adversary import Adversary
+from advbox.attacks.gradient_method import FGSM
 from advbox.models.paddle import PaddleModel
 
 
@@ -73,7 +73,7 @@ def main():
     # advbox demo
     m = PaddleModel(fluid.default_main_program(), IMG_NAME, LABEL_NAME,
                     logits.name, avg_cost.name, (-1, 1))
-    att = GradientSignAttack(m)
+    att = FGSM(m)
     for data in train_reader():
         # fgsm attack
         adversary = att(Adversary(data[0][0], data[0][1]))
diff --git a/fluid/adversarial/mnist_tutorial_jsma.py b/fluid/adversarial/mnist_tutorial_jsma.py
new file mode 100644
index 0000000000..d9db8b712c
--- /dev/null
+++ b/fluid/adversarial/mnist_tutorial_jsma.py
@@ -0,0 +1,97 @@
+"""
+FGSM demos on mnist using advbox tool.
+"""
+import matplotlib.pyplot as plt
+import paddle.v2 as paddle
+import paddle.fluid as fluid
+import numpy as np
+
+from advbox import Adversary
+from advbox.attacks.saliency import SaliencyMapAttack
+from advbox.models.paddle import PaddleModel
+
+
+def cnn_model(img):
+    """
+    Mnist cnn model
+    Args:
+        img(Varaible): the input image to be recognized
+    Returns:
+        Variable: the label prediction
+    """
+    # conv1 = fluid.nets.conv2d()
+    conv_pool_1 = fluid.nets.simple_img_conv_pool(
+        input=img,
+        num_filters=20,
+        filter_size=5,
+        pool_size=2,
+        pool_stride=2,
+        act='relu')
+
+    conv_pool_2 = fluid.nets.simple_img_conv_pool(
+        input=conv_pool_1,
+        num_filters=50,
+        filter_size=5,
+        pool_size=2,
+        pool_stride=2,
+        act='relu')
+
+    logits = fluid.layers.fc(input=conv_pool_2, size=10, act='softmax')
+    return logits
+
+
+def main():
+    """
+    Advbox demo which demonstrate how to use advbox.
+    """
+    IMG_NAME = 'img'
+    LABEL_NAME = 'label'
+
+    img = fluid.layers.data(name=IMG_NAME, shape=[1, 28, 28], dtype='float32')
+    # gradient should flow
+    img.stop_gradient = False
+    label = fluid.layers.data(name=LABEL_NAME, shape=[1], dtype='int64')
+    logits = cnn_model(img)
+    cost = fluid.layers.cross_entropy(input=logits, label=label)
+    avg_cost = fluid.layers.mean(x=cost)
+
+    place = fluid.CPUPlace()
+    exe = fluid.Executor(place)
+
+    BATCH_SIZE = 1
+    train_reader = paddle.batch(
+        paddle.reader.shuffle(
+            paddle.dataset.mnist.train(), buf_size=500),
+        batch_size=BATCH_SIZE)
+    feeder = fluid.DataFeeder(
+        feed_list=[IMG_NAME, LABEL_NAME],
+        place=place,
+        program=fluid.default_main_program())
+
+    fluid.io.load_params(
+        exe, "./mnist/", main_program=fluid.default_main_program())
+
+    # advbox demo
+    m = PaddleModel(fluid.default_main_program(), IMG_NAME, LABEL_NAME,
+                    logits.name, avg_cost.name, (-1, 1))
+    attack = SaliencyMapAttack(m)
+    total_num = 0
+    success_num = 0
+    for data in train_reader():
+        total_num += 1
+        # adversary.set_target(True, target_label=target_label)
+        jsma_attack = attack(Adversary(data[0][0], data[0][1]))
+        if jsma_attack is not None and jsma_attack.is_successful():
+            # plt.imshow(jsma_attack.target, cmap='Greys_r')
+            # plt.show()
+            success_num += 1
+            print('original_label=%d, adversary examples label =%d' %
+                  (data[0][1], jsma_attack.adversarial_label))
+            # np.save('adv_img', jsma_attack.adversarial_example)
+        print('total num = %d, success num = %d ' % (total_num, success_num))
+        if total_num == 100:
+            break
+
+
+if __name__ == '__main__':
+    main()
diff --git a/fluid/image_classification/README.md b/fluid/image_classification/README.md
index 3d9f340b3e..b950fbe1a7 100644
--- a/fluid/image_classification/README.md
+++ b/fluid/image_classification/README.md
@@ -1,3 +1,7 @@
+The minimum PaddlePaddle version needed for the code sample in this directory is the lastest develop branch. If you are on a version of PaddlePaddle earlier than this, [please update your installation](http://www.paddlepaddle.org/docs/develop/documentation/en/build_and_install/pip_install_en.html).
+
+---
+
 # SE-ResNeXt for image classification
 
 This model built with paddle fluid is still under active development and is not
diff --git a/fluid/image_classification/caffe2fluid/README.md b/fluid/image_classification/caffe2fluid/README.md
new file mode 100644
index 0000000000..5f565afe0c
--- /dev/null
+++ b/fluid/image_classification/caffe2fluid/README.md
@@ -0,0 +1,36 @@
+### Caffe2Fluid
+This tool is used to convert a Caffe model to Fluid model
+
+### Howto
+1, Prepare caffepb.py in ./proto if your python has no 'pycaffe' module, two options provided here:
+
+    1) generate it from caffe.proto using protoc
+        bash ./proto/compile.sh
+
+    2) download one from github directly
+        cd proto/ && wget https://github.com/ethereon/caffe-tensorflow/blob/master/kaffe/caffe/caffepb.py
+
+2, Convert the caffe model using 'convert.py' which will generate a python script and a weight(in .npy) file
+
+3, Use the converted model to predict
+
+    see more detail info in 'examples/xxx'
+
+
+### Tested models
+- Lenet on mnist dataset
+
+- ResNets:(ResNet-50, ResNet-101, ResNet-152)
+    model addr: `https://onedrive.live.com/?authkey=%21AAFW2-FVoxeVRck&id=4006CBB8476FF777%2117887&cid=4006CBB8476FF777`_
+
+- GoogleNet:
+    model addr: `https://gist.github.com/jimmie33/7ea9f8ac0da259866b854460f4526034`_
+
+- VGG:
+    model addr: `https://gist.github.com/ksimonyan/211839e770f7b538e2d8`_
+
+- AlexNet:
+    model addr: `https://github.com/BVLC/caffe/tree/master/models/bvlc_alexnet`_
+
+### Notes
+Some of this code come from here: https://github.com/ethereon/caffe-tensorflow
diff --git a/fluid/image_classification/caffe2fluid/convert.py b/fluid/image_classification/caffe2fluid/convert.py
new file mode 100755
index 0000000000..379f1a2636
--- /dev/null
+++ b/fluid/image_classification/caffe2fluid/convert.py
@@ -0,0 +1,75 @@
+#!/usr/bin/env python
+
+import os
+import sys
+import numpy as np
+import argparse
+
+from kaffe import KaffeError, print_stderr
+from kaffe.paddle import Transformer
+
+
+def fatal_error(msg):
+    """ fatal error encounted
+    """
+    print_stderr(msg)
+    exit(-1)
+
+
+def validate_arguments(args):
+    """ validate args
+    """
+    if (args.data_output_path is not None) and (args.caffemodel is None):
+        fatal_error('No input data path provided.')
+    if (args.caffemodel is not None) and (args.data_output_path is None):
+        fatal_error('No output data path provided.')
+    if (args.code_output_path is None) and (args.data_output_path is None):
+        fatal_error('No output path specified.')
+
+
+def convert(def_path, caffemodel_path, data_output_path, code_output_path,
+            phase):
+    """ convert caffe model to tf/paddle models
+    """
+    try:
+        transformer = Transformer(def_path, caffemodel_path, phase=phase)
+        print_stderr('Converting data...')
+        if caffemodel_path is not None:
+            data = transformer.transform_data()
+            print_stderr('Saving data...')
+            with open(data_output_path, 'wb') as data_out:
+                np.save(data_out, data)
+        if code_output_path:
+            print_stderr('Saving source...')
+            with open(code_output_path, 'wb') as src_out:
+                src_out.write(transformer.transform_source())
+        print_stderr('Done.')
+    except KaffeError as err:
+        fatal_error('Error encountered: {}'.format(err))
+
+    return 0
+
+
+def main():
+    """ main
+    """
+    parser = argparse.ArgumentParser()
+    parser.add_argument('def_path', help='Model definition (.prototxt) path')
+    parser.add_argument('--caffemodel', help='Model data (.caffemodel) path')
+    parser.add_argument('--data-output-path', help='Converted data output path')
+    parser.add_argument(
+        '--code-output-path', help='Save generated source to this path')
+    parser.add_argument(
+        '-p',
+        '--phase',
+        default='test',
+        help='The phase to convert: test (default) or train')
+    args = parser.parse_args()
+    validate_arguments(args)
+    return convert(args.def_path, args.caffemodel, args.data_output_path,
+                   args.code_output_path, args.phase)
+
+
+if __name__ == '__main__':
+    ret = main()
+    sys.exit(ret)
diff --git a/fluid/image_classification/caffe2fluid/examples/imagenet/README.md b/fluid/image_classification/caffe2fluid/examples/imagenet/README.md
new file mode 100644
index 0000000000..b820508592
--- /dev/null
+++ b/fluid/image_classification/caffe2fluid/examples/imagenet/README.md
@@ -0,0 +1,10 @@
+a demo to show converting caffe models on 'imagenet' using caffe2fluid
+
+---
+
+# How to use
+
+1. prepare python environment
+2. download caffe model to "models.caffe/xxx" which contains "xxx.caffemodel" and "xxx.prototxt"
+3. run the tool
+    eg: bash ./run.sh resnet50 ./models.caffe/resnet50 ./models/resnet50
diff --git a/fluid/image_classification/caffe2fluid/examples/imagenet/data/65.jpeg b/fluid/image_classification/caffe2fluid/examples/imagenet/data/65.jpeg
new file mode 100644
index 0000000000..fd3a93f593
Binary files /dev/null and b/fluid/image_classification/caffe2fluid/examples/imagenet/data/65.jpeg differ
diff --git a/fluid/image_classification/caffe2fluid/examples/imagenet/infer.py b/fluid/image_classification/caffe2fluid/examples/imagenet/infer.py
new file mode 100644
index 0000000000..ec594199be
--- /dev/null
+++ b/fluid/image_classification/caffe2fluid/examples/imagenet/infer.py
@@ -0,0 +1,142 @@
+#!/bin/env python
+
+#function:
+#   a demo to show how to use the converted model genereated by caffe2fluid
+#   
+#notes:
+#   only support imagenet data
+
+import os
+import sys
+import inspect
+import numpy as np
+import paddle.v2 as paddle
+import paddle.v2.fluid as fluid
+
+
+def load_data(imgfile, shape):
+    h, w = shape[1:]
+    from PIL import Image
+    im = Image.open(imgfile)
+
+    # The storage order of the loaded image is W(widht),
+    # H(height), C(channel). PaddlePaddle requires
+    # the CHW order, so transpose them.
+    im = im.resize((w, h), Image.ANTIALIAS)
+    im = np.array(im).astype(np.float32)
+    im = im.transpose((2, 0, 1))  # CHW
+    im = im[(2, 1, 0), :, :]  # BGR
+
+    # The mean to be subtracted from each image.
+    # By default, the per-channel ImageNet mean.
+    mean = np.array([104., 117., 124.], dtype=np.float32)
+    mean = mean.reshape([3, 1, 1])
+    im = im - mean
+    return im.reshape([1] + shape)
+
+
+def build_model(net_file, net_name):
+    print('build model with net_file[%s] and net_name[%s]' %
+          (net_file, net_name))
+
+    net_path = os.path.dirname(net_file)
+    module_name = os.path.basename(net_file).rstrip('.py')
+    if net_path not in sys.path:
+        sys.path.insert(0, net_path)
+
+    try:
+        m = __import__(module_name, fromlist=[net_name])
+        MyNet = getattr(m, net_name)
+    except Exception as e:
+        print('failed to load module[%s]' % (module_name))
+        print(e)
+        return None
+
+    input_name = 'data'
+    input_shape = MyNet.input_shapes()[input_name]
+    images = fluid.layers.data(name='image', shape=input_shape, dtype='float32')
+    #label = fluid.layers.data(name='label', shape=[1], dtype='int64')
+
+    net = MyNet({input_name: images})
+    input_shape = MyNet.input_shapes()[input_name]
+    return net, input_shape
+
+
+def dump_results(results, names, root):
+    if os.path.exists(root) is False:
+        os.path.mkdir(root)
+
+    for i in range(len(names)):
+        n = names[i]
+        res = results[i]
+        filename = os.path.join(root, n)
+        np.save(filename + '.npy', res)
+
+
+def infer(net_file, net_name, model_file, imgfile, debug=False):
+    """ do inference using a model which consist 'xxx.py' and 'xxx.npy'
+    """
+    #1, build model
+    net, input_shape = build_model(net_file, net_name)
+    prediction = net.get_output()
+
+    #2, load weights for this model
+    place = fluid.CPUPlace()
+    exe = fluid.Executor(place)
+    startup_program = fluid.default_startup_program()
+    exe.run(startup_program)
+
+    if model_file.find('.npy') > 0:
+        net.load(data_path=model_file, exe=exe, place=place)
+    else:
+        net.load(data_path=model_file, exe=exe)
+
+    #3, test this model
+    test_program = fluid.default_main_program().clone()
+
+    fetch_list_var = []
+    fetch_list_name = []
+    if debug is False:
+        fetch_list_var.append(prediction)
+    else:
+        for k, v in net.layers.items():
+            fetch_list_var.append(v)
+            fetch_list_name.append(k)
+
+    np_images = load_data(imgfile, input_shape)
+    results = exe.run(program=test_program,
+                      feed={'image': np_images},
+                      fetch_list=fetch_list_var)
+
+    if debug is True:
+        dump_path = 'results.layers'
+        dump_results(results, fetch_list_name, dump_path)
+        print('all results dumped to [%s]' % (dump_path))
+    else:
+        result = results[0]
+        print('predicted class:', np.argmax(result))
+
+
+if __name__ == "__main__":
+    """ maybe more convenient to use 'run.sh' to call this tool
+    """
+    net_file = 'models/resnet50/resnet50.py'
+    weight_file = 'models/resnet50/resnet50.npy'
+    imgfile = 'data/65.jpeg'
+    net_name = 'ResNet50'
+
+    argc = len(sys.argv)
+    if argc == 5:
+        net_file = sys.argv[1]
+        weight_file = sys.argv[2]
+        imgfile = sys.argv[3]
+        net_name = sys.argv[4]
+    elif argc > 1:
+        print('usage:')
+        print('\tpython %s [net_file] [weight_file] [imgfile] [net_name]' %
+              (sys.argv[0]))
+        print('\teg:python %s %s %s %s %s' % (sys.argv[0], net_file,
+                                              weight_file, imgfile, net_name))
+        sys.exit(1)
+
+    infer(net_file, net_name, weight_file, imgfile)
diff --git a/fluid/image_classification/caffe2fluid/examples/imagenet/run.sh b/fluid/image_classification/caffe2fluid/examples/imagenet/run.sh
new file mode 100644
index 0000000000..7a1a5ebd7c
--- /dev/null
+++ b/fluid/image_classification/caffe2fluid/examples/imagenet/run.sh
@@ -0,0 +1,72 @@
+#!/bin/bash
+
+#function:
+#   a tool used to:
+#       1, convert a caffe model
+#       2, do inference using this model
+#
+#usage:
+#   bash run.sh resnet50 ./models.caffe/resnet50 ./models/resnet50
+#
+
+#set -x
+if [[ $# -lt 3 ]];then
+    echo "usage:"
+    echo "  bash $0 [model_name] [cf_model_path] [pd_model_path] [only_convert]"
+    echo "  eg: bash $0 resnet50 ./models.caffe/resnet50 ./models/resnet50"
+    exit 1
+else
+    model_name=$1
+    cf_model_path=$2
+    pd_model_path=$3
+    only_convert=$4
+fi
+
+proto_file=$cf_model_path/${model_name}.prototxt
+caffemodel_file=$cf_model_path/${model_name}.caffemodel
+weight_file=$pd_model_path/${model_name}.npy
+net_file=$pd_model_path/${model_name}.py
+
+if [[ ! -e $proto_file ]];then
+    echo "not found prototxt[$proto_file]"
+    exit 1
+fi
+
+if [[ ! -e $caffemodel_file ]];then
+    echo "not found caffemodel[$caffemodel_file]"
+    exit 1
+fi
+
+if [[ ! -e $pd_model_path ]];then
+    mkdir $pd_model_path
+fi
+
+PYTHON=`which cfpython`
+if [[ -z $PYTHON ]];then
+    PYTHON=`which python`
+fi
+$PYTHON ../../convert.py \
+        $proto_file \
+        --caffemodel $caffemodel_file \
+        --data-output-path $weight_file\
+        --code-output-path $net_file
+
+ret=$?
+if [[ $ret -ne 0 ]];then
+    echo "failed to convert caffe model[$cf_model_path]"
+    exit $ret
+else
+    echo "succeed to convert caffe model[$cf_model_path] to fluid model[$pd_model_path]"
+fi
+
+if [[ -z $only_convert ]];then
+    PYTHON=`which pdpython`
+    if [[ -z $PYTHON ]];then
+        PYTHON=`which python`
+    fi
+    imgfile="data/65.jpeg"
+    net_name=`grep "name" $proto_file | head -n1 | perl -ne 'if(/\"([^\"]+)\"/){ print $1."\n";}'`
+    $PYTHON ./infer.py $net_file $weight_file $imgfile $net_name
+    ret=$?
+fi
+exit $ret
diff --git a/fluid/image_classification/caffe2fluid/examples/mnist/README.md b/fluid/image_classification/caffe2fluid/examples/mnist/README.md
new file mode 100644
index 0000000000..cd427d6327
--- /dev/null
+++ b/fluid/image_classification/caffe2fluid/examples/mnist/README.md
@@ -0,0 +1,10 @@
+a demo to show converting caffe model on 'mnist' using caffe2fluid
+
+---
+
+# How to use
+
+1. prepare python environment
+2. download caffe model to "models.caffe/lenet" which contains "lenet.caffemodel" and "lenet.prototxt"
+3. run the tool
+    eg: bash ./run.sh lenet ./models.caffe/lenet ./models/lenet
diff --git a/fluid/image_classification/caffe2fluid/examples/mnist/evaluate.py b/fluid/image_classification/caffe2fluid/examples/mnist/evaluate.py
new file mode 100644
index 0000000000..5c86635d5a
--- /dev/null
+++ b/fluid/image_classification/caffe2fluid/examples/mnist/evaluate.py
@@ -0,0 +1,86 @@
+#!/bin/env python
+
+#function:
+#   demo to show how to use converted model using caffe2fluid
+#
+
+import sys
+import os
+import numpy as np
+import paddle.v2 as paddle
+import paddle.v2.fluid as fluid
+
+
+def test_model(exe, test_program, fetch_list, test_reader, feeder):
+    acc_set = []
+
+    for data in test_reader():
+        acc_np, pred = exe.run(program=test_program,
+                               feed=feeder.feed(data),
+                               fetch_list=fetch_list)
+        acc_set.append(float(acc_np))
+
+    acc_val = np.array(acc_set).mean()
+    return float(acc_val)
+
+
+def evaluate(net_file, model_file):
+    """ main
+    """
+    #1, build model
+    net_path = os.path.dirname(net_file)
+    if net_path not in sys.path:
+        sys.path.insert(0, net_path)
+
+    from lenet import LeNet as MyNet
+
+    with_gpu = False
+    paddle.init(use_gpu=with_gpu)
+
+    #1, define network topology
+    images = fluid.layers.data(name='image', shape=[1, 28, 28], dtype='float32')
+    label = fluid.layers.data(name='label', shape=[1], dtype='int64')
+
+    net = MyNet({'data': images})
+    prediction = net.layers['prob']
+    acc = fluid.layers.accuracy(input=prediction, label=label)
+
+    place = fluid.CUDAPlace(0) if with_gpu is True else fluid.CPUPlace()
+    exe = fluid.Executor(place)
+    exe.run(fluid.default_startup_program())
+
+    #2, load weights
+    if model_file.find('.npy') > 0:
+        net.load(data_path=model_file, exe=exe, place=place)
+    else:
+        net.load(data_path=model_file, exe=exe)
+
+    #3, test this model
+    test_program = fluid.default_main_program().clone()
+    test_reader = paddle.batch(paddle.dataset.mnist.test(), batch_size=128)
+
+    feeder = fluid.DataFeeder(feed_list=[images, label], place=place)
+    fetch_list = [acc, prediction]
+
+    print('go to test model using test set')
+    acc_val = test_model(exe, test_program, \
+            fetch_list, test_reader, feeder)
+
+    print('test accuracy is [%.4f], expected value[0.919]' % (acc_val))
+
+
+if __name__ == "__main__":
+    net_file = 'models/lenet/lenet.py'
+    weight_file = 'models/lenet/lenet.npy'
+
+    argc = len(sys.argv)
+    if argc == 3:
+        net_file = sys.argv[1]
+        weight_file = sys.argv[2]
+    elif argc > 1:
+        print('usage:')
+        print('\tpython %s [net_file] [weight_file]' % (sys.argv[0]))
+        print('\teg:python %s %s %s %s' % (sys.argv[0], net_file, weight_file))
+        sys.exit(1)
+
+    evaluate(net_file, weight_file)
diff --git a/fluid/image_classification/caffe2fluid/examples/mnist/run.sh b/fluid/image_classification/caffe2fluid/examples/mnist/run.sh
new file mode 100644
index 0000000000..eee83ef7ce
--- /dev/null
+++ b/fluid/image_classification/caffe2fluid/examples/mnist/run.sh
@@ -0,0 +1,75 @@
+#!/bin/bash
+
+#function:
+#   a tool used to:
+#       1, convert a caffe model
+#       2, do inference using this model
+#
+#usage:
+#   bash run.sh lenet ./models.caffe/lenet ./models/lenet
+#
+
+#set -x
+if [[ $# -lt 3 ]];then
+    echo "usage:"
+    echo "  bash $0 [model_name] [cf_model_path] [pd_model_path] [only_convert]"
+    echo "  eg: bash $0 lenet ./models.caffe/lenet ./models/lenet"
+    exit 1
+else
+    model_name=$1
+    cf_model_path=$2
+    pd_model_path=$3
+    no_eval=$4
+fi
+
+proto_file=$cf_model_path/${model_name}.prototxt
+caffemodel_file=$cf_model_path/${model_name}.caffemodel
+weight_file=$pd_model_path/${model_name}.npy
+net_file=$pd_model_path/${model_name}.py
+
+if [[ ! -e $proto_file ]];then
+    echo "not found prototxt[$proto_file]"
+    exit 1
+fi
+
+if [[ ! -e $caffemodel_file ]];then
+    echo "not found caffemodel[$caffemodel_file]"
+    exit 1
+fi
+
+if [[ ! -e $pd_model_path ]];then
+    mkdir $pd_model_path
+fi
+
+PYTHON=`which cfpython`
+if [[ -z $PYTHON ]];then
+    PYTHON=`which python`
+fi
+$PYTHON ../../convert.py \
+        $proto_file \
+        --caffemodel $caffemodel_file \
+        --data-output-path $weight_file\
+        --code-output-path $net_file
+
+ret=$?
+if [[ $ret -ne 0 ]];then
+    echo "failed to convert caffe model[$cf_model_path]"
+    exit $ret
+else
+    echo "succeed to convert caffe model[$cf_model_path] to fluid model[$pd_model_path]"
+fi
+
+if [[ -z $only_convert ]];then
+    PYTHON=`which pdpython`
+    if [[ -z $PYTHON ]];then
+        PYTHON=`which python`
+    fi
+    net_name=`grep "name" $proto_file | head -n1 | perl -ne 'if(/\"([^\"]+)\"/){ print $1."\n";}'`
+    if [[ $net_name != "LeNet" ]];then
+        echo "only support LeNet"
+        exit 1
+    fi
+    $PYTHON ./evaluate.py $net_file $weight_file
+    ret=$?
+fi
+exit $ret
diff --git a/fluid/image_classification/caffe2fluid/kaffe/__init__.py b/fluid/image_classification/caffe2fluid/kaffe/__init__.py
new file mode 100644
index 0000000000..c11ce45c63
--- /dev/null
+++ b/fluid/image_classification/caffe2fluid/kaffe/__init__.py
@@ -0,0 +1,5 @@
+from .graph import GraphBuilder, NodeMapper
+from .errors import KaffeError, print_stderr
+
+import os
+from . import paddle
diff --git a/fluid/image_classification/caffe2fluid/kaffe/caffe/__init__.py b/fluid/image_classification/caffe2fluid/kaffe/caffe/__init__.py
new file mode 100644
index 0000000000..8d53dee29d
--- /dev/null
+++ b/fluid/image_classification/caffe2fluid/kaffe/caffe/__init__.py
@@ -0,0 +1 @@
+from .resolver import get_caffe_resolver, has_pycaffe
diff --git a/fluid/image_classification/caffe2fluid/kaffe/caffe/resolver.py b/fluid/image_classification/caffe2fluid/kaffe/caffe/resolver.py
new file mode 100644
index 0000000000..6ad7767ed8
--- /dev/null
+++ b/fluid/image_classification/caffe2fluid/kaffe/caffe/resolver.py
@@ -0,0 +1,60 @@
+import os
+import sys
+
+SHARED_CAFFE_RESOLVER = None
+
+
+def import_caffepb():
+    p = os.path.realpath(__file__)
+    p = os.path.dirname(p)
+    p = os.path.join(p, '../../proto')
+    sys.path.insert(0, p)
+    import caffepb
+    return caffepb
+
+
+class CaffeResolver(object):
+    def __init__(self):
+        self.import_caffe()
+
+    def import_caffe(self):
+        self.caffe = None
+        try:
+            # Try to import PyCaffe first
+            import caffe
+            self.caffe = caffe
+        except ImportError:
+            # Fall back to the protobuf implementation
+            self.caffepb = import_caffepb()
+            show_fallback_warning()
+        if self.caffe:
+            # Use the protobuf code from the imported distribution.
+            # This way, Caffe variants with custom layers will work.
+            self.caffepb = self.caffe.proto.caffe_pb2
+        self.NetParameter = self.caffepb.NetParameter
+
+    def has_pycaffe(self):
+        return self.caffe is not None
+
+
+def get_caffe_resolver():
+    global SHARED_CAFFE_RESOLVER
+    if SHARED_CAFFE_RESOLVER is None:
+        SHARED_CAFFE_RESOLVER = CaffeResolver()
+    return SHARED_CAFFE_RESOLVER
+
+
+def has_pycaffe():
+    return get_caffe_resolver().has_pycaffe()
+
+
+def show_fallback_warning():
+    msg = '''
+------------------------------------------------------------
+    WARNING: PyCaffe not found!
+    Falling back to a pure protocol buffer implementation.
+    * Conversions will be drastically slower.
+------------------------------------------------------------
+
+'''
+    sys.stderr.write(msg)
diff --git a/fluid/image_classification/caffe2fluid/kaffe/errors.py b/fluid/image_classification/caffe2fluid/kaffe/errors.py
new file mode 100644
index 0000000000..75eced5778
--- /dev/null
+++ b/fluid/image_classification/caffe2fluid/kaffe/errors.py
@@ -0,0 +1,34 @@
+import sys
+
+#debug level, can be 'warn', 'verbose'
+log_level = 'warn'
+
+
+class KaffeError(Exception):
+    pass
+
+
+def print_stderr(msg):
+    sys.stderr.write('%s\n' % msg)
+
+
+def debug(msg):
+    if log_level == 'verbose':
+        print_stderr('[DEBUG]' + msg)
+
+
+def notice(msg):
+    print_stderr('[NOTICE]' + msg)
+
+
+def warn(msg):
+    print_stderr('[WARNING]' + msg)
+
+
+def set_loglevel(level):
+    global log_level
+
+    if 'warn' != level and 'verbose' != level:
+        raise Exception('not supported log level[%s]' % (level))
+
+    log_level = level
diff --git a/fluid/image_classification/caffe2fluid/kaffe/graph.py b/fluid/image_classification/caffe2fluid/kaffe/graph.py
new file mode 100644
index 0000000000..5387f44185
--- /dev/null
+++ b/fluid/image_classification/caffe2fluid/kaffe/graph.py
@@ -0,0 +1,304 @@
+from google.protobuf import text_format
+
+from .caffe import get_caffe_resolver
+from .errors import KaffeError, print_stderr
+from .layers import LayerAdapter, LayerType, NodeKind, NodeDispatch
+from .shapes import TensorShape
+
+
+class Node(object):
+    def __init__(self, name, kind, layer=None):
+        self.name = name
+        self.kind = kind
+        self.layer = LayerAdapter(layer, kind) if layer else None
+        self.parents = []
+        self.children = []
+        self.data = None
+        self.output_shape = None
+        self.metadata = {}
+
+    def add_parent(self, parent_node):
+        assert parent_node not in self.parents
+        self.parents.append(parent_node)
+        if self not in parent_node.children:
+            parent_node.children.append(self)
+
+    def add_child(self, child_node):
+        assert child_node not in self.children
+        self.children.append(child_node)
+        if self not in child_node.parents:
+            child_node.parents.append(self)
+
+    def get_only_parent(self):
+        if len(self.parents) != 1:
+            raise KaffeError('Node (%s) expected to have 1 parent. Found %s.' %
+                             (self, len(self.parents)))
+        return self.parents[0]
+
+    @property
+    def parameters(self):
+        if self.layer is not None:
+            return self.layer.parameters
+        return None
+
+    def __str__(self):
+        return '[%s] %s' % (self.kind, self.name)
+
+    def __repr__(self):
+        return '%s (0x%x)' % (self.name, id(self))
+
+
+class Graph(object):
+    def __init__(self, nodes=None, name=None):
+        self.nodes = nodes or []
+        self.node_lut = {node.name: node for node in self.nodes}
+        self.name = name
+
+    def add_node(self, node):
+        self.nodes.append(node)
+        self.node_lut[node.name] = node
+
+    def get_node(self, name):
+        try:
+            return self.node_lut[name]
+        except KeyError:
+            raise KaffeError('Layer not found: %s' % name)
+
+    def get_input_nodes(self):
+        return [node for node in self.nodes if len(node.parents) == 0]
+
+    def get_output_nodes(self):
+        return [node for node in self.nodes if len(node.children) == 0]
+
+    def topologically_sorted(self):
+        sorted_nodes = []
+        unsorted_nodes = list(self.nodes)
+        temp_marked = set()
+        perm_marked = set()
+
+        def visit(node):
+            if node in temp_marked:
+                raise KaffeError('Graph is not a DAG.')
+            if node in perm_marked:
+                return
+            temp_marked.add(node)
+            for child in node.children:
+                visit(child)
+            perm_marked.add(node)
+            temp_marked.remove(node)
+            sorted_nodes.insert(0, node)
+
+        while len(unsorted_nodes):
+            visit(unsorted_nodes.pop())
+        return sorted_nodes
+
+    def compute_output_shapes(self):
+        sorted_nodes = self.topologically_sorted()
+        for node in sorted_nodes:
+            node.output_shape = TensorShape(
+                *NodeKind.compute_output_shape(node))
+
+    def replaced(self, new_nodes):
+        return Graph(nodes=new_nodes, name=self.name)
+
+    def transformed(self, transformers):
+        graph = self
+        for transformer in transformers:
+            graph = transformer(graph)
+            if graph is None:
+                raise KaffeError('Transformer failed: {}'.format(transformer))
+            assert isinstance(graph, Graph)
+        return graph
+
+    def __contains__(self, key):
+        return key in self.node_lut
+
+    def __str__(self):
+        hdr = '{:<20} {:<30} {:>20} {:>20}'.format('Type', 'Name', 'Param',
+                                                   'Output')
+        s = [hdr, '-' * 94]
+        for node in self.topologically_sorted():
+            # If the node has learned parameters, display the first one's shape.
+            # In case of convolutions, this corresponds to the weights.
+            data_shape = node.data[0].shape if node.data else '--'
+            out_shape = node.output_shape or '--'
+            s.append('{:<20} {:<30} {:>20} {:>20}'.format(
+                node.kind, node.name, data_shape, tuple(out_shape)))
+        return '\n'.join(s)
+
+
+class GraphBuilder(object):
+    '''Constructs a model graph from a Caffe protocol buffer definition.'''
+
+    def __init__(self, def_path, phase='test'):
+        '''
+        def_path: Path to the model definition (.prototxt)
+        data_path: Path to the model data (.caffemodel)
+        phase: Either 'test' or 'train'. Used for filtering phase-specific nodes.
+        '''
+        self.def_path = def_path
+        self.phase = phase
+        self.load()
+
+    def load(self):
+        '''Load the layer definitions from the prototxt.'''
+        self.params = get_caffe_resolver().NetParameter()
+        with open(self.def_path, 'rb') as def_file:
+            text_format.Merge(def_file.read(), self.params)
+
+    def filter_layers(self, layers):
+        '''Filter out layers based on the current phase.'''
+        phase_map = {0: 'train', 1: 'test'}
+        filtered_layer_names = set()
+        filtered_layers = []
+        for layer in layers:
+            phase = self.phase
+            if len(layer.include):
+                phase = phase_map[layer.include[0].phase]
+            if len(layer.exclude):
+                phase = phase_map[1 - layer.include[0].phase]
+            exclude = (phase != self.phase)
+            # Dropout layers appear in a fair number of Caffe
+            # test-time networks. These are just ignored. We'll
+            # filter them out here.
+            if (not exclude) and (phase == 'test'):
+                exclude = (layer.type == LayerType.Dropout)
+            if not exclude:
+                filtered_layers.append(layer)
+                # Guard against dupes.
+                assert layer.name not in filtered_layer_names
+                filtered_layer_names.add(layer.name)
+        return filtered_layers
+
+    def make_node(self, layer):
+        '''Create a graph node for the given layer.'''
+        kind = NodeKind.map_raw_kind(layer.type)
+        if kind is None:
+            raise KaffeError('Unknown layer type encountered: %s' % layer.type)
+
+        # We want to use the layer's top names (the "output" names), rather than the
+        # name attribute, which is more of readability thing than a functional one.
+        # Other layers will refer to a node by its "top name".
+        return Node(layer.name, kind, layer=layer)
+
+    def make_input_nodes(self):
+        '''
+        Create data input nodes.
+
+        This method is for old-style inputs, where the input specification
+        was not treated as a first-class layer in the prototext.
+        Newer models use the "Input layer" type.
+        '''
+        nodes = [Node(name, NodeKind.Data) for name in self.params.input]
+        if len(nodes):
+            input_dim = map(int, self.params.input_dim)
+            if not input_dim:
+                if len(self.params.input_shape) > 0:
+                    input_dim = map(int, self.params.input_shape[0].dim)
+                else:
+                    raise KaffeError('Dimensions for input not specified.')
+            for node in nodes:
+                node.output_shape = tuple(input_dim)
+        return nodes
+
+    def build(self):
+        '''
+        Builds the graph from the Caffe layer definitions.
+        '''
+        # Get the layers
+        layers = self.params.layers or self.params.layer
+        # Filter out phase-excluded layers
+        layers = self.filter_layers(layers)
+        # Get any separately-specified input layers
+        nodes = self.make_input_nodes()
+        nodes += [self.make_node(layer) for layer in layers]
+        # Initialize the graph
+        graph = Graph(nodes=nodes, name=self.params.name)
+        # Connect the nodes
+        #
+        # A note on layers and outputs:
+        # In Caffe, each layer can produce multiple outputs ("tops") from a set of inputs
+        # ("bottoms"). The bottoms refer to other layers' tops. The top can rewrite a bottom
+        # (in case of in-place operations). Note that the layer's name is not used for establishing
+        # any connectivity. It's only used for data association. By convention, a layer with a
+        # single top will often use the same name (although this is not required).
+        #
+        # The current implementation only supports single-output nodes (note that a node can still
+        # have multiple children, since multiple child nodes can refer to the single top's name).
+        node_outputs = {}
+        for layer in layers:
+            node = graph.get_node(layer.name)
+            for input_name in layer.bottom:
+                assert input_name != layer.name
+                parent_node = node_outputs.get(input_name)
+                if (parent_node is None) or (parent_node == node):
+                    parent_node = graph.get_node(input_name)
+                node.add_parent(parent_node)
+            if len(layer.top) > 1:
+                raise KaffeError('Multiple top nodes are not supported.')
+
+            for output_name in layer.top:
+                if output_name == layer.name:
+                    # Output is named the same as the node. No further action required.
+                    continue
+                # There are two possibilities here:
+                #
+                # Case 1: output_name refers to another node in the graph.
+                # This is an "in-place operation" that overwrites an existing node.
+                # This would create a cycle in the graph. We'll undo the in-placing
+                # by substituting this node wherever the overwritten node is referenced.
+                #
+                # Case 2: output_name violates the convention layer.name == output_name.
+                # Since we are working in the single-output regime, we will can rename it to
+                # match the layer name.
+                #
+                # For both cases, future references to this top re-routes to this node.
+                node_outputs[output_name] = node
+
+        graph.compute_output_shapes()
+        return graph
+
+
+class NodeMapper(NodeDispatch):
+    def __init__(self, graph):
+        self.graph = graph
+
+    def map(self):
+        nodes = self.graph.topologically_sorted()
+        # Remove input nodes - we'll handle them separately.
+        input_nodes = self.graph.get_input_nodes()
+        nodes = [t for t in nodes if t not in input_nodes]
+        # Decompose DAG into chains.
+        chains = []
+        for node in nodes:
+            attach_to_chain = None
+            if len(node.parents) == 1:
+                parent = node.get_only_parent()
+                for chain in chains:
+                    if chain[-1] == parent:
+                        # Node is part of an existing chain.
+                        attach_to_chain = chain
+                        break
+            if attach_to_chain is None:
+                # Start a new chain for this node.
+                attach_to_chain = []
+                chains.append(attach_to_chain)
+            attach_to_chain.append(node)
+        # Map each chain.
+        mapped_chains = []
+        for chain in chains:
+            mapped_chains.append(self.map_chain(chain))
+        return self.commit(mapped_chains)
+
+    def map_chain(self, chain):
+        return [self.map_node(node) for node in chain]
+
+    def map_node(self, node):
+        map_func = self.get_handler(node.kind, 'map')
+        mapped_node = map_func(node)
+        assert mapped_node is not None
+        mapped_node.node = node
+        return mapped_node
+
+    def commit(self, mapped_chains):
+        raise NotImplementedError('Must be implemented by subclass.')
diff --git a/fluid/image_classification/caffe2fluid/kaffe/layers.py b/fluid/image_classification/caffe2fluid/kaffe/layers.py
new file mode 100644
index 0000000000..f263407ab4
--- /dev/null
+++ b/fluid/image_classification/caffe2fluid/kaffe/layers.py
@@ -0,0 +1,214 @@
+import re
+import numbers
+from collections import namedtuple
+
+from .shapes import *
+
+LAYER_DESCRIPTORS = {
+
+    # Caffe Types
+    'AbsVal': shape_identity,
+    'Accuracy': shape_scalar,
+    'ArgMax': shape_not_implemented,
+    'BatchNorm': shape_identity,
+    'BNLL': shape_not_implemented,
+    'Concat': shape_concat,
+    'ContrastiveLoss': shape_scalar,
+    'Convolution': shape_convolution,
+    'Deconvolution': shape_not_implemented,
+    'Data': shape_data,
+    'Dropout': shape_identity,
+    'DummyData': shape_data,
+    'EuclideanLoss': shape_scalar,
+    'Eltwise': shape_identity,
+    'Exp': shape_identity,
+    'Flatten': shape_not_implemented,
+    'HDF5Data': shape_data,
+    'HDF5Output': shape_identity,
+    'HingeLoss': shape_scalar,
+    'Im2col': shape_not_implemented,
+    'ImageData': shape_data,
+    'InfogainLoss': shape_scalar,
+    'InnerProduct': shape_inner_product,
+    'Input': shape_data,
+    'LRN': shape_identity,
+    'MemoryData': shape_mem_data,
+    'MultinomialLogisticLoss': shape_scalar,
+    'MVN': shape_not_implemented,
+    'Pooling': shape_pool,
+    'Power': shape_identity,
+    'ReLU': shape_identity,
+    'Scale': shape_identity,
+    'Sigmoid': shape_identity,
+    'SigmoidCrossEntropyLoss': shape_scalar,
+    'Silence': shape_not_implemented,
+    'Softmax': shape_identity,
+    'SoftmaxWithLoss': shape_scalar,
+    'Split': shape_not_implemented,
+    'Slice': shape_not_implemented,
+    'TanH': shape_identity,
+    'WindowData': shape_not_implemented,
+    'Threshold': shape_identity,
+}
+
+# layer types in 'V1LayerParameter'
+# (v1layertype name, enum value, mapped to layer type)
+v1_layertypes = [
+    ('ABSVAL', 35),
+    ('ACCURACY', 1),
+    ('ARGMAX', 30),
+    ('BNLL', 2),
+    ('CONCAT', 3),
+    ('CONVOLUTION', 4),
+    ('DATA', 5),
+    ('DECONVOLUTION', 39),
+    ('DROPOUT', 6),
+    ('ELTWISE', 25),
+    ('EXP', 38),
+    ('FLATTEN', 8),
+    ('IM2COL', 11),
+    ('INNERPRODUCT', 14),
+    ('LRN', 15),
+    ('MEMORYDATA', 29),
+    ('MULTINOMIALLOGISTICLOSS', 16),
+    ('MVN', 34),
+    ('POOLING', 17),
+    ('POWER', 26),
+    ('RELU', 18),
+    ('SIGMOID', 19),
+    ('SIGMOIDCROSSENTROPYLOSS', 27),
+    ('SILENCE', 36),
+    ('SOFTMAX', 20),
+    ('SPLIT', 22),
+    ('SLICE', 33),
+    ('TANH', 23),
+    ('WINDOWDATA', 24),
+    ('THRESHOLD', 31),
+]
+
+LAYER_TYPES = LAYER_DESCRIPTORS.keys()
+LayerType = type('LayerType', (), {t: t for t in LAYER_TYPES})
+
+#map the layer name in V1 to standard name
+V1_LAYER_MAP = {'_not_init_': True}
+
+
+def get_v1_layer_map():
+    global V1_LAYER_MAP
+    if '_not_init_' not in V1_LAYER_MAP:
+        return V1_LAYER_MAP
+    else:
+        del V1_LAYER_MAP['_not_init_']
+
+    name2layer = {}
+    for n in LAYER_TYPES:
+        name2layer[n.upper()] = n
+
+    for l in v1_layertypes:
+        n, v = l
+        if n in name2layer and v not in V1_LAYER_MAP:
+            V1_LAYER_MAP[v] = name2layer[n]
+        else:
+            raise KaffeError('not found v1 layer type %s' % n)
+    return V1_LAYER_MAP
+
+
+class NodeKind(LayerType):
+    @staticmethod
+    def map_raw_kind(kind):
+        if kind in LAYER_TYPES:
+            return kind
+
+        v1_layers = get_v1_layer_map()
+        if kind in v1_layers:
+            return v1_layers[kind]
+        else:
+            return None
+
+    @staticmethod
+    def compute_output_shape(node):
+        try:
+            val = LAYER_DESCRIPTORS[node.kind](node)
+            return val
+        except NotImplementedError:
+            raise KaffeError(
+                'Output shape computation not implemented for type: %s' %
+                node.kind)
+
+
+class NodeDispatchError(KaffeError):
+
+    pass
+
+
+class NodeDispatch(object):
+    @staticmethod
+    def get_handler_name(node_kind):
+        if len(node_kind) <= 4:
+            # A catch-all for things like ReLU and tanh
+            return node_kind.lower()
+        # Convert from CamelCase to under_scored
+        name = re.sub('(.)([A-Z][a-z]+)', r'\1_\2', node_kind)
+        return re.sub('([a-z0-9])([A-Z])', r'\1_\2', name).lower()
+
+    def get_handler(self, node_kind, prefix):
+        name = self.get_handler_name(node_kind)
+        name = '_'.join((prefix, name))
+        try:
+            return getattr(self, name)
+        except AttributeError:
+            raise NodeDispatchError(
+                'No handler found for node kind: %s (expected: %s)' %
+                (node_kind, name))
+
+
+class LayerAdapter(object):
+    def __init__(self, layer, kind):
+        self.layer = layer
+        self.kind = kind
+
+    @property
+    def parameters(self):
+        name = NodeDispatch.get_handler_name(self.kind)
+        name = '_'.join((name, 'param'))
+        try:
+            return getattr(self.layer, name)
+        except AttributeError:
+            raise NodeDispatchError(
+                'Caffe parameters not found for layer kind: %s' % (self.kind))
+
+    @staticmethod
+    def get_kernel_value(scalar, repeated, idx, default=None):
+        if scalar:
+            return scalar
+        if repeated:
+            if isinstance(repeated, numbers.Number):
+                return repeated
+            if len(repeated) == 1:
+                # Same value applies to all spatial dimensions
+                return int(repeated[0])
+            assert idx < len(repeated)
+            # Extract the value for the given spatial dimension
+            return repeated[idx]
+        if default is None:
+            raise ValueError('Unable to determine kernel parameter!')
+        return default
+
+    @property
+    def kernel_parameters(self):
+        assert self.kind in (NodeKind.Convolution, NodeKind.Pooling)
+        params = self.parameters
+        k_h = self.get_kernel_value(params.kernel_h, params.kernel_size, 0)
+        k_w = self.get_kernel_value(params.kernel_w, params.kernel_size, 1)
+        s_h = self.get_kernel_value(
+            params.stride_h, params.stride, 0, default=1)
+        s_w = self.get_kernel_value(
+            params.stride_w, params.stride, 1, default=1)
+        p_h = self.get_kernel_value(params.pad_h, params.pad, 0, default=0)
+        p_w = self.get_kernel_value(params.pad_h, params.pad, 1, default=0)
+        return KernelParameters(k_h, k_w, s_h, s_w, p_h, p_w)
+
+
+KernelParameters = namedtuple('KernelParameters', [
+    'kernel_h', 'kernel_w', 'stride_h', 'stride_w', 'pad_h', 'pad_w'
+])
diff --git a/fluid/image_classification/caffe2fluid/kaffe/paddle/__init__.py b/fluid/image_classification/caffe2fluid/kaffe/paddle/__init__.py
new file mode 100644
index 0000000000..685b653c39
--- /dev/null
+++ b/fluid/image_classification/caffe2fluid/kaffe/paddle/__init__.py
@@ -0,0 +1,2 @@
+from .transformer import Transformer
+from .network import Network
diff --git a/fluid/image_classification/caffe2fluid/kaffe/paddle/network.py b/fluid/image_classification/caffe2fluid/kaffe/paddle/network.py
new file mode 100644
index 0000000000..fd6a71cb6a
--- /dev/null
+++ b/fluid/image_classification/caffe2fluid/kaffe/paddle/network.py
@@ -0,0 +1,289 @@
+import math
+import os
+import numpy as np
+
+
+def import_fluid():
+    import paddle.v2.fluid as fluid
+    return fluid
+
+
+def layer(op):
+    '''Decorator for composable network layers.'''
+
+    def layer_decorated(self, *args, **kwargs):
+        # Automatically set a name if not provided.
+        name = kwargs.setdefault('name', self.get_unique_name(op.__name__))
+        # Figure out the layer inputs.
+        if len(self.terminals) == 0:
+            raise RuntimeError('No input variables found for layer %s.' % name)
+        elif len(self.terminals) == 1:
+            layer_input = self.terminals[0]
+        else:
+            layer_input = list(self.terminals)
+        # Perform the operation and get the output.
+        layer_output = op(self, layer_input, *args, **kwargs)
+        # Add to layer LUT.
+        self.layers[name] = layer_output
+        # This output is now the input for the next layer.
+        self.feed(layer_output)
+        #print('output shape of %s:' % (name))
+        #print layer_output.shape
+
+        # Return self for chained calls.
+        return self
+
+    return layer_decorated
+
+
+class Network(object):
+    def __init__(self, inputs, trainable=True):
+        # The input nodes for this network
+        self.inputs = inputs
+        # The current list of terminal nodes
+        self.terminals = []
+        # Mapping from layer names to layers
+        self.layers = dict(inputs)
+        # If true, the resulting variables are set as trainable
+        self.trainable = trainable
+        # Switch variable for dropout
+        self.paddle_env = None
+        self.setup()
+
+    def setup(self):
+        '''Construct the network. '''
+        raise NotImplementedError('Must be implemented by the subclass.')
+
+    def load(self, data_path, exe=None, place=None, ignore_missing=False):
+        '''Load network weights.
+        data_path: The path to the numpy-serialized network weights
+        ignore_missing: If true, serialized weights for missing layers are ignored.
+        '''
+        fluid = import_fluid()
+        #load fluid mode directly
+        if os.path.isdir(data_path):
+            assert (exe is not None), \
+                'must provide a executor to load fluid model'
+            fluid.io.load_persistables_if_exist(executor=exe, dirname=data_path)
+            return True
+
+        #load model from a npy file
+        if exe is None or place is None:
+            if self.paddle_env is None:
+                place = fluid.CPUPlace()
+                exe = fluid.Executor(place)
+                self.paddle_env = {'place': place, 'exe': exe}
+                exe = exe.run(fluid.default_startup_program())
+            else:
+                place = self.paddle_env['place']
+                exe = self.paddle_env['exe']
+
+        data_dict = np.load(data_path).item()
+        for op_name in data_dict:
+            layer = self.layers[op_name]
+            for param_name, data in data_dict[op_name].iteritems():
+                try:
+                    name = '%s_%s' % (op_name, param_name)
+                    v = fluid.global_scope().find_var(name)
+                    w = v.get_tensor()
+                    w.set(data, place)
+                except ValueError:
+                    if not ignore_missing:
+                        raise
+        return True
+
+    def feed(self, *args):
+        '''Set the input(s) for the next operation by replacing the terminal nodes.
+        The arguments can be either layer names or the actual layers.
+        '''
+        assert len(args) != 0
+        self.terminals = []
+        for fed_layer in args:
+            if isinstance(fed_layer, basestring):
+                try:
+                    fed_layer = self.layers[fed_layer]
+                except KeyError:
+                    raise KeyError('Unknown layer name fed: %s' % fed_layer)
+            self.terminals.append(fed_layer)
+        return self
+
+    def get_output(self):
+        '''Returns the current network output.'''
+        return self.terminals[-1]
+
+    def get_unique_name(self, prefix):
+        '''Returns an index-suffixed unique name for the given prefix.
+        This is used for auto-generating layer names based on the type-prefix.
+        '''
+        ident = sum(t.startswith(prefix) for t, _ in self.layers.items()) + 1
+        return '%s_%d' % (prefix, ident)
+
+    @layer
+    def conv(self,
+             input,
+             k_h,
+             k_w,
+             c_o,
+             s_h,
+             s_w,
+             name,
+             relu=True,
+             padding=None,
+             group=1,
+             biased=True):
+        if padding is None:
+            padding = [0, 0]
+
+        # Get the number of channels in the input
+        c_i, h_i, w_i = input.shape[1:]
+
+        # Verify that the grouping parameter is valid
+        assert c_i % group == 0
+        assert c_o % group == 0
+
+        fluid = import_fluid()
+        prefix = name + '_'
+        output = fluid.layers.conv2d(
+            input=input,
+            filter_size=[k_h, k_w],
+            num_filters=c_o,
+            stride=[s_h, s_w],
+            padding=padding,
+            groups=group,
+            param_attr=fluid.ParamAttr(name=prefix + "weights"),
+            bias_attr=fluid.ParamAttr(name=prefix + "biases"),
+            act="relu" if relu is True else None)
+        return output
+
+    @layer
+    def relu(self, input, name):
+        fluid = import_fluid()
+        output = fluid.layers.relu(x=input)
+        return output
+
+    def _adjust_pad_if_needed(self, i_hw, k_hw, s_hw, p_hw):
+        #adjust the padding if needed
+        i_h, i_w = i_hw
+        k_h, k_w = k_hw
+        s_h, s_w = s_hw
+        p_h, p_w = p_hw
+
+        def is_consistent(i, k, s, p):
+            o = i + 2 * p - k
+            if o % s == 0:
+                return True
+            else:
+                return False
+
+        real_p_h = 0
+        real_p_w = 0
+        if is_consistent(i_h, k_h, s_h, p_h) is False:
+            real_p_h = int(k_h / 2)
+
+        if is_consistent(i_w, k_w, s_w, p_w) is False:
+            real_p_w = int(k_w / 2)
+
+        return [real_p_h, real_p_w]
+
+    def pool(self, pool_type, input, k_h, k_w, s_h, s_w, name, padding):
+        # Get the number of channels in the input
+        in_hw = input.shape[2:]
+        k_hw = [k_h, k_w]
+        s_hw = [s_h, s_w]
+
+        if padding is None:
+            #fix bug about the difference between conv and pool
+            #more info: https://github.com/BVLC/caffe/issues/1318
+            padding = self._adjust_pad_if_needed(in_hw, k_hw, s_hw, [0, 0])
+
+        fluid = import_fluid()
+        output = fluid.layers.pool2d(
+            input=input,
+            pool_size=k_hw,
+            pool_stride=s_hw,
+            pool_padding=padding,
+            pool_type=pool_type)
+        return output
+
+    @layer
+    def max_pool(self, input, k_h, k_w, s_h, s_w, name, padding=None):
+        return self.pool('max', input, k_h, k_w, s_h, s_w, name, padding)
+
+    @layer
+    def avg_pool(self, input, k_h, k_w, s_h, s_w, name, padding=None):
+        return self.pool('avg', input, k_h, k_w, s_h, s_w, name, padding)
+
+    @layer
+    def lrn(self, input, radius, alpha, beta, name, bias=1.0):
+        fluid = import_fluid()
+        output = fluid.layers.lrn(input=input, \
+                n=radius, k=bias, alpha=alpha, beta=beta, name=name)
+        return output
+
+    @layer
+    def concat(self, inputs, axis, name):
+        fluid = import_fluid()
+        output = fluid.layers.concat(input=inputs, axis=axis)
+        return output
+
+    @layer
+    def add(self, inputs, name):
+        fluid = import_fluid()
+        output = inputs[0]
+        for i in inputs[1:]:
+            output = fluid.layers.elementwise_add(x=output, y=i)
+        return output
+
+    @layer
+    def fc(self, input, num_out, name, relu=True, act=None):
+        fluid = import_fluid()
+
+        if act is None:
+            act = 'relu' if relu is True else None
+
+        prefix = name + '_'
+        output = fluid.layers.fc(
+            name=name,
+            input=input,
+            size=num_out,
+            act=act,
+            param_attr=fluid.ParamAttr(name=prefix + 'weights'),
+            bias_attr=fluid.ParamAttr(name=prefix + 'biases'))
+        return output
+
+    @layer
+    def softmax(self, input, name):
+        fluid = import_fluid()
+        output = fluid.layers.softmax(input)
+        return output
+
+    @layer
+    def batch_normalization(self, input, name, scale_offset=True, relu=False):
+        # NOTE: Currently, only inference is supported
+        fluid = import_fluid()
+        prefix = name + '_'
+        param_attr = None if scale_offset is False else fluid.ParamAttr(
+            name=prefix + 'scale')
+        bias_attr = None if scale_offset is False else fluid.ParamAttr(
+            name=prefix + 'offset')
+        mean_name = prefix + 'mean'
+        variance_name = prefix + 'variance'
+        output = fluid.layers.batch_norm(
+            name=name,
+            input=input,
+            is_test=True,
+            param_attr=param_attr,
+            bias_attr=bias_attr,
+            moving_mean_name=mean_name,
+            moving_variance_name=variance_name,
+            epsilon=1e-5,
+            act='relu' if relu is True else None)
+
+        return output
+
+    @layer
+    def dropout(self, input, drop_prob, name, is_test=True):
+        fluid = import_fluid()
+        output = fluid.layers.dropout(
+            input, dropout_prob=drop_prob, is_test=is_test, name=name)
+        return output
diff --git a/fluid/image_classification/caffe2fluid/kaffe/paddle/transformer.py b/fluid/image_classification/caffe2fluid/kaffe/paddle/transformer.py
new file mode 100644
index 0000000000..4d7ec49a39
--- /dev/null
+++ b/fluid/image_classification/caffe2fluid/kaffe/paddle/transformer.py
@@ -0,0 +1,364 @@
+import numpy as np
+
+from ..errors import KaffeError, print_stderr
+from ..graph import GraphBuilder, NodeMapper
+from ..layers import NodeKind
+from ..transformers import (DataInjector, DataReshaper, NodeRenamer, ReLUFuser,
+                            BatchNormScaleBiasFuser, BatchNormPreprocessor,
+                            ParameterNamer)
+from . import network
+
+
+def get_padding_type(kernel_params, input_shape, output_shape):
+    '''Translates Caffe's numeric padding to one of ('SAME', 'VALID').
+    Caffe supports arbitrary padding values, while TensorFlow only
+    supports 'SAME' and 'VALID' modes. So, not all Caffe paddings
+    can be translated to TensorFlow. There are some subtleties to
+    how the padding edge-cases are handled. These are described here:
+    https://github.com/Yangqing/caffe2/blob/master/caffe2/proto/caffe2_legacy.proto
+    '''
+    k_h, k_w, s_h, s_w, p_h, p_w = kernel_params
+    if p_h * p_w > 0:
+        return [p_h, p_w]
+    else:
+        return None
+
+
+class TensorFlowNode(object):
+    '''An intermediate representation for TensorFlow operations.'''
+
+    def __init__(self, op, *args, **kwargs):
+        # A string corresponding to the TensorFlow operation
+        self.op = op
+        # Positional arguments for the operation
+        self.args = args
+        # Keyword arguments for the operation
+        self.kwargs = list(kwargs.items())
+        # The source Caffe node
+        self.node = None
+
+    def format(self, arg):
+        '''Returns a string representation for the given value.'''
+        return "'%s'" % arg if isinstance(arg, basestring) else str(arg)
+
+    def pair(self, key, value):
+        '''Returns key=formatted(value).'''
+        return '%s=%s' % (key, self.format(value))
+
+    def emit(self):
+        '''Emits the Python source for this node.'''
+        # Format positional arguments
+        args = map(self.format, self.args)
+        # Format any keyword arguments
+        if self.kwargs:
+            args += [self.pair(k, v) for k, v in self.kwargs]
+        # Set the node name
+        args.append(self.pair('name', self.node.name))
+        args = ', '.join(args)
+        return '%s(%s)' % (self.op, args)
+
+
+class MaybeActivated(object):
+    def __init__(self, node, default=True):
+        self.inject_kwargs = {}
+        if node.metadata.get('relu', False) != default:
+            self.inject_kwargs['relu'] = not default
+
+    def __call__(self, *args, **kwargs):
+        kwargs.update(self.inject_kwargs)
+        return TensorFlowNode(*args, **kwargs)
+
+
+class TensorFlowMapper(NodeMapper):
+    def get_kernel_params(self, node):
+        kernel_params = node.layer.kernel_parameters
+        input_shape = node.get_only_parent().output_shape
+        padding = get_padding_type(kernel_params, input_shape,
+                                   node.output_shape)
+        # Only emit the padding if it's not the default value.
+        padding = {'padding': padding} if padding is not None else {}
+        return (kernel_params, padding)
+
+    def map_convolution(self, node):
+        (kernel_params, kwargs) = self.get_kernel_params(node)
+        h = kernel_params.kernel_h
+        w = kernel_params.kernel_w
+        c_o = node.output_shape[1]
+        c_i = node.parents[0].output_shape[1]
+        group = node.parameters.group
+        if group != 1:
+            kwargs['group'] = group
+        if not node.parameters.bias_term:
+            kwargs['biased'] = False
+        assert kernel_params.kernel_h == h
+        assert kernel_params.kernel_w == w
+        return MaybeActivated(node)(
+            'conv', kernel_params.kernel_h, kernel_params.kernel_w, c_o,
+            kernel_params.stride_h, kernel_params.stride_w, **kwargs)
+
+    def map_relu(self, node):
+        return TensorFlowNode('relu')
+
+    def map_pooling(self, node):
+        pool_type = node.parameters.pool
+        if pool_type == 0:
+            pool_op = 'max_pool'
+        elif pool_type == 1:
+            pool_op = 'avg_pool'
+        else:
+            # Stochastic pooling, for instance.
+            raise KaffeError('Unsupported pooling type.')
+        (kernel_params, padding) = self.get_kernel_params(node)
+        return TensorFlowNode(pool_op, kernel_params.kernel_h,
+                              kernel_params.kernel_w, kernel_params.stride_h,
+                              kernel_params.stride_w, **padding)
+
+    def map_inner_product(self, node):
+        #TODO: Axis
+        assert node.parameters.axis == 1
+        #TODO: Unbiased
+        assert node.parameters.bias_term == True
+        return MaybeActivated(node)('fc', node.parameters.num_output)
+
+    def map_softmax(self, node):
+        return TensorFlowNode('softmax')
+
+    def map_lrn(self, node):
+        params = node.parameters
+        # The window size must be an odd value. For a window
+        # size of (2*n+1), TensorFlow defines depth_radius = n.
+        assert params.local_size % 2 == 1
+        # Caffe scales by (alpha/(2*n+1)), whereas TensorFlow
+        # just scales by alpha (as does Krizhevsky's paper).
+        # We'll account for that here.
+        alpha = params.alpha / float(params.local_size)
+        return TensorFlowNode('lrn', params.local_size, alpha, params.beta)
+
+    def map_concat(self, node):
+        return TensorFlowNode('concat', node.parameters.axis)
+
+    def map_dropout(self, node):
+        return TensorFlowNode('dropout', node.parameters.dropout_ratio)
+
+    def map_batch_norm(self, node):
+        scale_offset = len(node.data) == 4
+        kwargs = {} if scale_offset else {'scale_offset': False}
+        return MaybeActivated(
+            node, default=False)('batch_normalization', **kwargs)
+
+    def map_eltwise(self, node):
+        operations = {0: 'multiply', 1: 'add', 2: 'max'}
+        op_code = node.parameters.operation
+        try:
+            return TensorFlowNode(operations[op_code])
+        except KeyError:
+            raise KaffeError('Unknown elementwise operation: {}'.format(
+                op_code))
+
+    def commit(self, chains):
+        return chains
+
+
+class TensorFlowEmitter(object):
+    def __init__(self, tab=None):
+        self.tab = tab or ' ' * 4
+        self.prefix = ''
+        self.net_name = ''
+
+    def indent(self):
+        self.prefix += self.tab
+
+    def outdent(self):
+        self.prefix = self.prefix[:-len(self.tab)]
+
+    def statement(self, s):
+        return self.prefix + s + '\n'
+
+    def emit_imports(self):
+        import inspect
+        codes = []
+        codes.append(
+            '### generated by caffe2fluid, your net is in class "%s" ###\n' %
+            (self.net_name))
+        network_source = inspect.getsource(network)
+        codes.append(network_source + '\n')
+        return self.statement('\n'.join(codes))
+
+    def emit_class_def(self, name):
+        return self.statement('class %s(Network):' % (name))
+
+    def emit_setup_def(self):
+        return self.statement('def setup(self):')
+
+    def emit_shape_def(self, input_nodes):
+        self.outdent()
+        func_def = self.statement('@classmethod')
+        func_def += self.statement('def input_shapes(cls):')
+        self.indent()
+
+        input_shapes = {}
+        for n in input_nodes:
+            name = n.name
+            output_shape = n.output_shape
+            shape = [str(s) for s in output_shape[1:]]
+            input_shapes[name] = ', '.join(shape)
+        input_shapes = ['"%s": [%s]' % (n, l) for n, l in input_shapes.items()]
+        shape_str = ','.join(input_shapes)
+        func_def += self.statement('return {%s}' % (shape_str))
+        return '\n\n' + func_def
+
+    def emit_convert_def(self, input_nodes):
+        codes = []
+        inputs = {}
+        codes.append('shapes = cls.input_shapes()')
+        for n in input_nodes:
+            name = n.name
+            layer_var = name + '_layer'
+            layer_def = '%s = fluid.layers.data(name="%s", shape=shapes["%s"],'\
+                    ' dtype="float32")' % (layer_var, name, name)
+            #layer_var, layer_def = data_layer_def(n.name, n.output_shape)
+            codes.append(layer_def)
+            inputs[name] = layer_var
+
+        input_dict = ','.join(['"%s": %s' % (n, l) for n, l in inputs.items()])
+
+        codes.append('feed_data = {' + input_dict + '}')
+        codes.append('net = cls(feed_data)')
+
+        codes.append("place = fluid.CPUPlace()")
+        codes.append("exe = fluid.Executor(place)")
+        codes.append("exe.run(fluid.default_startup_program())")
+        codes.append("net.load(data_path=npy_model, exe=exe, place=place)")
+        codes.append(
+            "fluid.io.save_persistables(executor=exe, dirname=fluid_path)")
+
+        self.outdent()
+        func_def = self.statement('@classmethod')
+        func_def += self.statement('def convert(cls, npy_model, fluid_path):')
+        self.indent()
+        func_def += self.statement('import paddle.v2.fluid as fluid')
+        for l in codes:
+            func_def += self.statement(l)
+        return '\n' + func_def
+
+    def emit_main_def(self, name):
+        if name is None:
+            return ''
+
+        self.prefix = ''
+        main_def = self.statement('if __name__ == "__main__":')
+        self.indent()
+        main_def += self.statement("#usage: python xxxnet.py xxx.npy ./model\n")
+        main_def += self.statement("import sys")
+        main_def += self.statement("npy_weight = sys.argv[1]")
+        main_def += self.statement("fluid_model = sys.argv[2]")
+        main_def += self.statement("%s.convert(npy_weight, fluid_model)" %
+                                   (name))
+        main_def += self.statement("exit(0)")
+        return '\n\n' + main_def
+
+    def emit_parents(self, chain):
+        assert len(chain)
+        s = 'self.feed('
+        sep = ', \n' + self.prefix + (' ' * len(s))
+        s += sep.join(
+            ["'%s'" % parent.name for parent in chain[0].node.parents])
+        return self.statement(s + ')')
+
+    def emit_node(self, node):
+        return self.statement('self.' + node.emit())
+
+    def emit(self, name, chains, input_nodes=None):
+        self.net_name = name
+        s = self.emit_imports()
+        s += self.emit_class_def(name)
+        self.indent()
+        s += self.emit_setup_def()
+        self.indent()
+        blocks = []
+        for chain in chains:
+            b = ''
+            b += self.emit_parents(chain)
+            for node in chain:
+                b += self.emit_node(node)
+            blocks.append(b[:-1])
+        s = s + '\n\n'.join(blocks)
+        s += self.emit_shape_def(input_nodes)
+        s += self.emit_convert_def(input_nodes)
+        s += self.emit_main_def(name)
+        return s
+
+
+class Transformer(object):
+    def __init__(self, def_path, data_path, verbose=True, phase='test'):
+        self.verbose = verbose
+        self.phase = phase
+        self.load(def_path, data_path, phase)
+        self.params = None
+        self.source = None
+
+    def load(self, def_path, data_path, phase):
+        # Build the graph
+        graph = GraphBuilder(def_path, phase).build()
+
+        if data_path is not None:
+            # Load and associate learned parameters
+            graph = DataInjector(def_path, data_path)(graph)
+
+        # Transform the graph
+        transformers = [
+            # Fuse split batch normalization layers
+            BatchNormScaleBiasFuser(),
+
+            # Fuse ReLUs
+            # TODO: Move non-linearity application to layer wrapper, allowing
+            # any arbitrary operation to be optionally activated.
+            ReLUFuser(allowed_parent_types=[
+                NodeKind.Convolution, NodeKind.InnerProduct, NodeKind.BatchNorm
+            ]),
+
+            # Rename nodes
+            # Slashes are used for scoping in TensorFlow. Replace slashes
+            # in node names with underscores.
+            # (Caffe's GoogLeNet implementation uses slashes)
+            NodeRenamer(lambda node: node.name.replace('/', '_'))
+        ]
+        self.graph = graph.transformed(transformers)
+
+        # Display the graph
+        if self.verbose:
+            print_stderr(self.graph)
+
+    def transform_data(self):
+        if self.params is None:
+            transformers = [
+                # Reshape the parameters to TensorFlow's ordering
+                DataReshaper({
+                    # (c_o, c_i, h, w) -> (h, w, c_i, c_o) for TF
+                    NodeKind.Convolution: (0, 1, 2, 3),
+
+                    # (c_o, c_i) -> (c_i, c_o)
+                    NodeKind.InnerProduct: (1, 0)
+                }),
+
+                # Pre-process batch normalization data
+                BatchNormPreprocessor(),
+
+                # Convert parameters to dictionaries
+                ParameterNamer(),
+            ]
+            self.graph = self.graph.transformed(transformers)
+            self.params = {
+                node.name: node.data
+                for node in self.graph.nodes if node.data
+            }
+        return self.params
+
+    def transform_source(self):
+        if self.source is None:
+            mapper = TensorFlowMapper(self.graph)
+            chains = mapper.map()
+            emitter = TensorFlowEmitter()
+            input_nodes = self.graph.get_input_nodes()
+            self.source = emitter.emit(self.graph.name, chains, input_nodes)
+        return self.source
diff --git a/fluid/image_classification/caffe2fluid/kaffe/shapes.py b/fluid/image_classification/caffe2fluid/kaffe/shapes.py
new file mode 100644
index 0000000000..e8124730c6
--- /dev/null
+++ b/fluid/image_classification/caffe2fluid/kaffe/shapes.py
@@ -0,0 +1,88 @@
+import math
+from collections import namedtuple
+
+from .errors import KaffeError
+
+TensorShape = namedtuple('TensorShape',
+                         ['batch_size', 'channels', 'height', 'width'])
+
+
+def get_filter_output_shape(i_h, i_w, params, round_func):
+    o_h = (i_h + 2 * params.pad_h - params.kernel_h
+           ) / float(params.stride_h) + 1
+    o_w = (i_w + 2 * params.pad_w - params.kernel_w
+           ) / float(params.stride_w) + 1
+    return (int(round_func(o_h)), int(round_func(o_w)))
+
+
+def get_strided_kernel_output_shape(node, round_func):
+    assert node.layer is not None
+    input_shape = node.get_only_parent().output_shape
+    o_h, o_w = get_filter_output_shape(input_shape.height, input_shape.width,
+                                       node.layer.kernel_parameters, round_func)
+    params = node.layer.parameters
+    has_c_o = hasattr(params, 'num_output')
+    c = params.num_output if has_c_o else input_shape.channels
+    return TensorShape(input_shape.batch_size, c, o_h, o_w)
+
+
+def shape_not_implemented(node):
+    raise NotImplementedError
+
+
+def shape_identity(node):
+    assert len(node.parents) > 0
+    return node.parents[0].output_shape
+
+
+def shape_scalar(node):
+    return TensorShape(1, 1, 1, 1)
+
+
+def shape_data(node):
+    if node.output_shape:
+        # Old-style input specification
+        return node.output_shape
+    try:
+        # New-style input specification
+        return map(int, node.parameters.shape[0].dim)
+    except:
+        # We most likely have a data layer on our hands. The problem is,
+        # Caffe infers the dimensions of the data from the source (eg: LMDB).
+        # We want to avoid reading datasets here. Fail for now.
+        # This can be temporarily fixed by transforming the data layer to
+        # Caffe's "input" layer (as is usually used in the "deploy" version).
+        # TODO: Find a better solution for this.
+        raise KaffeError('Cannot determine dimensions of data layer.\n'
+                         'See comments in function shape_data for more info.')
+
+
+def shape_mem_data(node):
+    params = node.parameters
+    return TensorShape(params.batch_size, params.channels, params.height,
+                       params.width)
+
+
+def shape_concat(node):
+    axis = node.layer.parameters.axis
+    output_shape = None
+    for parent in node.parents:
+        if output_shape is None:
+            output_shape = list(parent.output_shape)
+        else:
+            output_shape[axis] += parent.output_shape[axis]
+    return tuple(output_shape)
+
+
+def shape_convolution(node):
+    return get_strided_kernel_output_shape(node, math.floor)
+
+
+def shape_pool(node):
+    return get_strided_kernel_output_shape(node, math.ceil)
+
+
+def shape_inner_product(node):
+    input_shape = node.get_only_parent().output_shape
+    return TensorShape(input_shape.batch_size, node.layer.parameters.num_output,
+                       1, 1)
diff --git a/fluid/image_classification/caffe2fluid/kaffe/transformers.py b/fluid/image_classification/caffe2fluid/kaffe/transformers.py
new file mode 100644
index 0000000000..9d300ca9c9
--- /dev/null
+++ b/fluid/image_classification/caffe2fluid/kaffe/transformers.py
@@ -0,0 +1,303 @@
+'''
+A collection of graph transforms.
+
+A transformer is a callable that accepts a graph and returns a transformed version.
+'''
+import os
+import numpy as np
+
+from .caffe import get_caffe_resolver, has_pycaffe
+from .errors import KaffeError, debug, notice, warn
+from .layers import NodeKind
+
+
+class DataInjector(object):
+    '''
+    Associates parameters loaded from a .caffemodel file with their corresponding nodes.
+    '''
+
+    def __init__(self, def_path, data_path):
+        # The .prototxt file defining the graph
+        self.def_path = def_path
+        # The .caffemodel file containing the learned parameters
+        self.data_path = data_path
+        # Set to true if the fallback protocol-buffer based backend was used
+        self.did_use_pb = False
+        # A list containing (layer name, parameters) tuples
+        self.params = None
+        # Load the parameters
+        self.load()
+
+    def load(self):
+        if has_pycaffe():
+            self.load_using_caffe()
+        else:
+            self.load_using_pb()
+
+    def load_using_caffe(self):
+        caffe = get_caffe_resolver().caffe
+        net = caffe.Net(self.def_path, self.data_path, caffe.TEST)
+        data = lambda blob: blob.data
+        self.params = [(k, map(data, v)) for k, v in net.params.items()]
+
+    def load_using_pb(self):
+        data = get_caffe_resolver().NetParameter()
+        data.MergeFromString(open(self.data_path, 'rb').read())
+        pair = lambda layer: (layer.name, self.normalize_pb_data(layer))
+        layers = data.layers or data.layer
+        self.params = [pair(layer) for layer in layers if layer.blobs]
+        self.did_use_pb = True
+
+    def normalize_pb_data(self, layer):
+        transformed = []
+        for blob in layer.blobs:
+            if len(blob.shape.dim):
+                dims = blob.shape.dim
+                c_o, c_i, h, w = map(int, [1] * (4 - len(dims)) + list(dims))
+            else:
+                c_o = blob.num
+                c_i = blob.channels
+                h = blob.height
+                w = blob.width
+            data = np.array(blob.data, dtype=np.float32).reshape(c_o, c_i, h, w)
+            transformed.append(data)
+        return transformed
+
+    def adjust_parameters(self, node, data):
+        if not self.did_use_pb:
+            return data
+        # When using the protobuf-backend, each parameter initially has four dimensions.
+        # In certain cases (like FC layers), we want to eliminate the singleton dimensions.
+        # This implementation takes care of the common cases. However, it does leave the
+        # potential for future issues.
+        # The Caffe-backend does not suffer from this problem.
+        data = list(data)
+        squeeze_indices = [1]  # Squeeze biases.
+        if node.kind == NodeKind.InnerProduct:
+            squeeze_indices.append(0)  # Squeeze FC.
+
+        for idx in squeeze_indices:
+            if idx >= len(data):
+                continue
+
+            shape_old = data[idx].shape
+            data[idx] = np.squeeze(data[idx])
+            shape_new = data[idx].shape
+            if len(shape_old) != shape_new:
+                debug('squeeze idx:%d, with kind:%s,name:%s' % \
+                        (idx, node.kind, node.name))
+        return data
+
+    def __call__(self, graph):
+        for layer_name, data in self.params:
+            if layer_name in graph:
+                node = graph.get_node(layer_name)
+                node.data = self.adjust_parameters(node, data)
+            else:
+                notice('Ignoring parameters for non-existent layer: %s' % \
+                        layer_name)
+        return graph
+
+
+class DataReshaper(object):
+    def __init__(self, mapping, replace=True):
+        # A dictionary mapping NodeKind to the transposed order.
+        self.mapping = mapping
+        # The node kinds eligible for reshaping
+        self.reshaped_node_types = self.mapping.keys()
+        # If true, the reshaped data will replace the old one.
+        # Otherwise, it's set to the reshaped_data attribute.
+        self.replace = replace
+
+    def has_spatial_parent(self, node):
+        try:
+            parent = node.get_only_parent()
+            s = parent.output_shape
+            return s.height > 1 or s.width > 1
+        except KaffeError:
+            return False
+
+    def map(self, node_kind):
+        try:
+            return self.mapping[node_kind]
+        except KeyError:
+            raise
+            #raise KaffeError('Ordering not found for node kind: {}'.format(node_kind))
+
+    def __call__(self, graph):
+        for node in graph.nodes:
+            if node.data is None:
+                continue
+            if node.kind not in self.reshaped_node_types:
+                # Check for 2+ dimensional data
+                if any(len(tensor.shape) > 1 for tensor in node.data):
+                    notice('parmaters not reshaped for node: {}'.format(node))
+                continue
+            transpose_order = self.map(node.kind)
+            weights = node.data[0]
+            if (node.kind == NodeKind.InnerProduct
+                ) and self.has_spatial_parent(node):
+                # The FC layer connected to the spatial layer needs to be
+                # re-wired to match the new spatial ordering.
+                in_shape = node.get_only_parent().output_shape
+                fc_shape = weights.shape
+                output_channels = fc_shape[0]
+                weights = weights.reshape((output_channels, -1))
+                weights = weights.transpose(transpose_order)
+                node.reshaped_data = weights
+            else:
+                node.reshaped_data = weights.transpose(transpose_order)
+
+        if self.replace:
+            for node in graph.nodes:
+                if hasattr(node, 'reshaped_data'):
+                    # Set the weights
+                    node.data[0] = node.reshaped_data
+                    del node.reshaped_data
+        return graph
+
+
+class SubNodeFuser(object):
+    '''
+    An abstract helper for merging a single-child with its single-parent.
+    '''
+
+    def __call__(self, graph):
+        nodes = graph.nodes
+        fused_nodes = []
+        for node in nodes:
+            if len(node.parents) != 1:
+                # We're only fusing nodes with single parents
+                continue
+            parent = node.get_only_parent()
+            if len(parent.children) != 1:
+                # We can only fuse a node if its parent's
+                # value isn't used by any other node.
+                continue
+            if not self.is_eligible_pair(parent, node):
+                continue
+            # Rewrite the fused node's children to its parent.
+            for child in node.children:
+                child.parents.remove(node)
+                parent.add_child(child)
+            # Disconnect the fused node from the graph.
+            parent.children.remove(node)
+            fused_nodes.append(node)
+            # Let the sub-class merge the fused node in any arbitrary way.
+            self.merge(parent, node)
+        transformed_nodes = [node for node in nodes if node not in fused_nodes]
+        return graph.replaced(transformed_nodes)
+
+    def is_eligible_pair(self, parent, child):
+        '''Returns true if this parent/child pair is eligible for fusion.'''
+        raise NotImplementedError('Must be implemented by subclass.')
+
+    def merge(self, parent, child):
+        '''Merge the child node into the parent.'''
+        raise NotImplementedError('Must be implemented by subclass')
+
+
+class ReLUFuser(SubNodeFuser):
+    '''
+    Fuses rectified linear units with their parent nodes.
+    '''
+
+    def __init__(self, allowed_parent_types=None):
+        # Fuse ReLUs when the parent node is one of the given types.
+        # If None, all node types are eligible.
+        self.allowed_parent_types = allowed_parent_types
+
+    def is_eligible_pair(self, parent, child):
+        return ((self.allowed_parent_types is None or \
+                parent.kind in self.allowed_parent_types) and \
+                child.kind == NodeKind.ReLU)
+
+    def merge(self, parent, _):
+        parent.metadata['relu'] = True
+
+
+class BatchNormScaleBiasFuser(SubNodeFuser):
+    '''
+    The original batch normalization paper includes two learned
+    parameters: a scaling factor \gamma and a bias \beta.
+    Caffe's implementation does not include these two. However, it is commonly
+    replicated by adding a scaling+bias layer immidiately after the batch norm.
+
+    This fuser merges the scaling+bias layer with the batch norm.
+    '''
+
+    def is_eligible_pair(self, parent, child):
+        return (parent.kind == NodeKind.BatchNorm and \
+                child.kind == NodeKind.Scale and \
+                child.parameters.axis == 1 and \
+                child.parameters.bias_term == True)
+
+    def merge(self, parent, child):
+        parent.scale_bias_node = child
+
+
+class BatchNormPreprocessor(object):
+    '''
+    Prescale batch normalization parameters.
+    Concatenate gamma (scale) and beta (bias) terms if set.
+    '''
+
+    def __call__(self, graph):
+        for node in graph.nodes:
+            if node.kind != NodeKind.BatchNorm:
+                continue
+            assert node.data is not None
+            assert len(node.data) == 3
+            node.data = [np.squeeze(i) for i in node.data]
+            mean, variance, scale = node.data
+            # Prescale the stats
+            scaling_factor = 1.0 / scale if scale != 0 else 0
+            mean *= scaling_factor
+            variance *= scaling_factor
+            # Replace with the updated values
+            node.data = [mean, variance]
+            if hasattr(node, 'scale_bias_node'):
+                # Include the scale and bias terms
+                gamma, beta = node.scale_bias_node.data
+                node.data += [np.squeeze(i) for i in [gamma, beta]]
+        return graph
+
+
+class NodeRenamer(object):
+    '''
+    Renames nodes in the graph using a given unary function that
+    accepts a node and returns its new name.
+    '''
+
+    def __init__(self, renamer):
+        self.renamer = renamer
+
+    def __call__(self, graph):
+        for node in graph.nodes:
+            node.name = self.renamer(node)
+        return graph
+
+
+class ParameterNamer(object):
+    '''
+    Convert layer data arrays to a dictionary mapping parameter names to their values.
+    '''
+
+    def __call__(self, graph):
+        for node in graph.nodes:
+            if node.data is None:
+                continue
+            if node.kind in (NodeKind.Convolution, NodeKind.InnerProduct):
+                names = ('weights', )
+                if node.parameters.bias_term:
+                    names += ('biases', )
+            elif node.kind == NodeKind.BatchNorm:
+                names = ('mean', 'variance')
+                if len(node.data) == 4:
+                    names += ('scale', 'offset')
+            else:
+                warn('Unhandled parameters: {}'.format(node.kind))
+                continue
+            assert len(names) == len(node.data)
+            node.data = dict(zip(names, node.data))
+        return graph
diff --git a/fluid/image_classification/caffe2fluid/proto/caffe.proto b/fluid/image_classification/caffe2fluid/proto/caffe.proto
new file mode 100644
index 0000000000..18eb5ca649
--- /dev/null
+++ b/fluid/image_classification/caffe2fluid/proto/caffe.proto
@@ -0,0 +1,1411 @@
+syntax = "proto2";
+
+package caffe;
+
+// Specifies the shape (dimensions) of a Blob.
+message BlobShape { repeated int64 dim = 1 [ packed = true ]; }
+
+message BlobProto {
+  optional BlobShape shape = 7;
+  repeated float data = 5 [ packed = true ];
+  repeated float diff = 6 [ packed = true ];
+  repeated double double_data = 8 [ packed = true ];
+  repeated double double_diff = 9 [ packed = true ];
+
+  // 4D dimensions -- deprecated.  Use "shape" instead.
+  optional int32 num = 1 [ default = 0 ];
+  optional int32 channels = 2 [ default = 0 ];
+  optional int32 height = 3 [ default = 0 ];
+  optional int32 width = 4 [ default = 0 ];
+}
+
+// The BlobProtoVector is simply a way to pass multiple blobproto instances
+// around.
+message BlobProtoVector { repeated BlobProto blobs = 1; }
+
+message Datum {
+  optional int32 channels = 1;
+  optional int32 height = 2;
+  optional int32 width = 3;
+  // the actual image data, in bytes
+  optional bytes data = 4;
+  optional int32 label = 5;
+  // Optionally, the datum could also hold float data.
+  repeated float float_data = 6;
+  // If true data contains an encoded image that need to be decoded
+  optional bool encoded = 7 [ default = false ];
+}
+
+message FillerParameter {
+  // The filler type.
+  optional string type = 1 [ default = 'constant' ];
+  optional float value = 2 [ default = 0 ]; // the value in constant filler
+  optional float min = 3 [ default = 0 ];   // the min value in uniform filler
+  optional float max = 4 [ default = 1 ];   // the max value in uniform filler
+  optional float mean = 5 [ default = 0 ];  // the mean value in Gaussian filler
+  optional float std = 6 [ default = 1 ];   // the std value in Gaussian filler
+  // The expected number of non-zero output weights for a given input in
+  // Gaussian filler -- the default -1 means don't perform sparsification.
+  optional int32 sparse = 7 [ default = -1 ];
+  // Normalize the filler variance by fan_in, fan_out, or their average.
+  // Applies to 'xavier' and 'msra' fillers.
+  enum VarianceNorm {
+    FAN_IN = 0;
+    FAN_OUT = 1;
+    AVERAGE = 2;
+  }
+  optional VarianceNorm variance_norm = 8 [ default = FAN_IN ];
+}
+
+message NetParameter {
+  optional string name = 1; // consider giving the network a name
+  // DEPRECATED. See InputParameter. The input blobs to the network.
+  repeated string input = 3;
+  // DEPRECATED. See InputParameter. The shape of the input blobs.
+  repeated BlobShape input_shape = 8;
+
+  // 4D input dimensions -- deprecated.  Use "input_shape" instead.
+  // If specified, for each input blob there should be four
+  // values specifying the num, channels, height and width of the input blob.
+  // Thus, there should be a total of (4 * #input) numbers.
+  repeated int32 input_dim = 4;
+
+  // Whether the network will force every layer to carry out backward operation.
+  // If set False, then whether to carry out backward is determined
+  // automatically according to the net structure and learning rates.
+  optional bool force_backward = 5 [ default = false ];
+  // The current "state" of the network, including the phase, level, and stage.
+  // Some layers may be included/excluded depending on this state and the states
+  // specified in the layers' include and exclude fields.
+  optional NetState state = 6;
+
+  // Print debugging information about results while running Net::Forward,
+  // Net::Backward, and Net::Update.
+  optional bool debug_info = 7 [ default = false ];
+
+  // The layers that make up the net.  Each of their configurations, including
+  // connectivity and behavior, is specified as a LayerParameter.
+  repeated LayerParameter layer = 100; // ID 100 so layers are printed last.
+
+  // DEPRECATED: use 'layer' instead.
+  repeated V1LayerParameter layers = 2;
+}
+
+// NOTE
+// Update the next available ID when you add a new SolverParameter field.
+//
+// SolverParameter next available ID: 42 (last added: layer_wise_reduce)
+message SolverParameter {
+  //////////////////////////////////////////////////////////////////////////////
+  // Specifying the train and test networks
+  //
+  // Exactly one train net must be specified using one of the following fields:
+  //     train_net_param, train_net, net_param, net
+  // One or more test nets may be specified using any of the following fields:
+  //     test_net_param, test_net, net_param, net
+  // If more than one test net field is specified (e.g., both net and
+  // test_net are specified), they will be evaluated in the field order given
+  // above: (1) test_net_param, (2) test_net, (3) net_param/net.
+  // A test_iter must be specified for each test_net.
+  // A test_level and/or a test_stage may also be specified for each test_net.
+  //////////////////////////////////////////////////////////////////////////////
+
+  // Proto filename for the train net, possibly combined with one or more
+  // test nets.
+  optional string net = 24;
+  // Inline train net param, possibly combined with one or more test nets.
+  optional NetParameter net_param = 25;
+
+  optional string train_net = 1; // Proto filename for the train net.
+  repeated string test_net = 2;  // Proto filenames for the test nets.
+  optional NetParameter train_net_param = 21; // Inline train net params.
+  repeated NetParameter test_net_param = 22;  // Inline test net params.
+
+  // The states for the train/test nets. Must be unspecified or
+  // specified once per net.
+  //
+  // By default, train_state will have phase = TRAIN,
+  // and all test_state's will have phase = TEST.
+  // Other defaults are set according to the NetState defaults.
+  optional NetState train_state = 26;
+  repeated NetState test_state = 27;
+
+  // The number of iterations for each test net.
+  repeated int32 test_iter = 3;
+
+  // The number of iterations between two testing phases.
+  optional int32 test_interval = 4 [ default = 0 ];
+  optional bool test_compute_loss = 19 [ default = false ];
+  // If true, run an initial test pass before the first iteration,
+  // ensuring memory availability and printing the starting value of the loss.
+  optional bool test_initialization = 32 [ default = true ];
+  optional float base_lr = 5; // The base learning rate
+  // the number of iterations between displaying info. If display = 0, no info
+  // will be displayed.
+  optional int32 display = 6;
+  // Display the loss averaged over the last average_loss iterations
+  optional int32 average_loss = 33 [ default = 1 ];
+  optional int32 max_iter = 7; // the maximum number of iterations
+  // accumulate gradients over `iter_size` x `batch_size` instances
+  optional int32 iter_size = 36 [ default = 1 ];
+
+  // The learning rate decay policy. The currently implemented learning rate
+  // policies are as follows:
+  //    - fixed: always return base_lr.
+  //    - step: return base_lr * gamma ^ (floor(iter / step))
+  //    - exp: return base_lr * gamma ^ iter
+  //    - inv: return base_lr * (1 + gamma * iter) ^ (- power)
+  //    - multistep: similar to step but it allows non uniform steps defined by
+  //      stepvalue
+  //    - poly: the effective learning rate follows a polynomial decay, to be
+  //      zero by the max_iter. return base_lr (1 - iter/max_iter) ^ (power)
+  //    - sigmoid: the effective learning rate follows a sigmod decay
+  //      return base_lr ( 1/(1 + exp(-gamma * (iter - stepsize))))
+  //
+  // where base_lr, max_iter, gamma, step, stepvalue and power are defined
+  // in the solver parameter protocol buffer, and iter is the current iteration.
+  optional string lr_policy = 8;
+  optional float gamma = 9;     // The parameter to compute the learning rate.
+  optional float power = 10;    // The parameter to compute the learning rate.
+  optional float momentum = 11; // The momentum value.
+  optional float weight_decay = 12; // The weight decay.
+  // regularization types supported: L1 and L2
+  // controlled by weight_decay
+  optional string regularization_type = 29 [ default = "L2" ];
+  // the stepsize for learning rate policy "step"
+  optional int32 stepsize = 13;
+  // the stepsize for learning rate policy "multistep"
+  repeated int32 stepvalue = 34;
+
+  // Set clip_gradients to >= 0 to clip parameter gradients to that L2 norm,
+  // whenever their actual L2 norm is larger.
+  optional float clip_gradients = 35 [ default = -1 ];
+
+  optional int32 snapshot = 14 [ default = 0 ]; // The snapshot interval
+  optional string snapshot_prefix = 15;         // The prefix for the snapshot.
+  // whether to snapshot diff in the results or not. Snapshotting diff will help
+  // debugging but the final protocol buffer size will be much larger.
+  optional bool snapshot_diff = 16 [ default = false ];
+  enum SnapshotFormat {
+    HDF5 = 0;
+    BINARYPROTO = 1;
+  }
+  optional SnapshotFormat snapshot_format = 37 [ default = BINARYPROTO ];
+  // the mode solver will use: 0 for CPU and 1 for GPU. Use GPU in default.
+  enum SolverMode {
+    CPU = 0;
+    GPU = 1;
+  }
+  optional SolverMode solver_mode = 17 [ default = GPU ];
+  // the device_id will that be used in GPU mode. Use device_id = 0 in default.
+  optional int32 device_id = 18 [ default = 0 ];
+  // If non-negative, the seed with which the Solver will initialize the Caffe
+  // random number generator -- useful for reproducible results. Otherwise,
+  // (and by default) initialize using a seed derived from the system clock.
+  optional int64 random_seed = 20 [ default = -1 ];
+
+  // type of the solver
+  optional string type = 40 [ default = "SGD" ];
+
+  // numerical stability for RMSProp, AdaGrad and AdaDelta and Adam
+  optional float delta = 31 [ default = 1e-8 ];
+  // parameters for the Adam solver
+  optional float momentum2 = 39 [ default = 0.999 ];
+
+  // RMSProp decay value
+  // MeanSquare(t) = rms_decay*MeanSquare(t-1) + (1-rms_decay)*SquareGradient(t)
+  optional float rms_decay = 38 [ default = 0.99 ];
+
+  // If true, print information about the state of the net that may help with
+  // debugging learning problems.
+  optional bool debug_info = 23 [ default = false ];
+
+  // If false, don't save a snapshot after training finishes.
+  optional bool snapshot_after_train = 28 [ default = true ];
+
+  // DEPRECATED: old solver enum types, use string instead
+  enum SolverType {
+    SGD = 0;
+    NESTEROV = 1;
+    ADAGRAD = 2;
+    RMSPROP = 3;
+    ADADELTA = 4;
+    ADAM = 5;
+  }
+  // DEPRECATED: use type instead of solver_type
+  optional SolverType solver_type = 30 [ default = SGD ];
+
+  // Overlap compute and communication for data parallel training
+  optional bool layer_wise_reduce = 41 [ default = true ];
+}
+
+// A message that stores the solver snapshots
+message SolverState {
+  optional int32 iter = 1;         // The current iteration
+  optional string learned_net = 2; // The file that stores the learned net.
+  repeated BlobProto history = 3;  // The history for sgd solvers
+  optional int32 current_step = 4
+      [ default = 0 ]; // The current step for learning rate
+}
+
+enum Phase {
+  TRAIN = 0;
+  TEST = 1;
+}
+
+message NetState {
+  optional Phase phase = 1 [ default = TEST ];
+  optional int32 level = 2 [ default = 0 ];
+  repeated string stage = 3;
+}
+
+message NetStateRule {
+  // Set phase to require the NetState have a particular phase (TRAIN or TEST)
+  // to meet this rule.
+  optional Phase phase = 1;
+
+  // Set the minimum and/or maximum levels in which the layer should be used.
+  // Leave undefined to meet the rule regardless of level.
+  optional int32 min_level = 2;
+  optional int32 max_level = 3;
+
+  // Customizable sets of stages to include or exclude.
+  // The net must have ALL of the specified stages and NONE of the specified
+  // "not_stage"s to meet the rule.
+  // (Use multiple NetStateRules to specify conjunctions of stages.)
+  repeated string stage = 4;
+  repeated string not_stage = 5;
+}
+
+// Specifies training parameters (multipliers on global learning constants,
+// and the name and other settings used for weight sharing).
+message ParamSpec {
+  // The names of the parameter blobs -- useful for sharing parameters among
+  // layers, but never required otherwise.  To share a parameter between two
+  // layers, give it a (non-empty) name.
+  optional string name = 1;
+
+  // Whether to require shared weights to have the same shape, or just the same
+  // count -- defaults to STRICT if unspecified.
+  optional DimCheckMode share_mode = 2;
+  enum DimCheckMode {
+    // STRICT (default) requires that num, channels, height, width each match.
+    STRICT = 0;
+    // PERMISSIVE requires only the count (num*channels*height*width) to match.
+    PERMISSIVE = 1;
+  }
+
+  // The multiplier on the global learning rate for this parameter.
+  optional float lr_mult = 3 [ default = 1.0 ];
+
+  // The multiplier on the global weight decay for this parameter.
+  optional float decay_mult = 4 [ default = 1.0 ];
+}
+
+// NOTE
+// Update the next available ID when you add a new LayerParameter field.
+//
+// LayerParameter next available layer-specific ID: 147 (last added:
+// recurrent_param)
+message LayerParameter {
+  optional string name = 1;   // the layer name
+  optional string type = 2;   // the layer type
+  repeated string bottom = 3; // the name of each bottom blob
+  repeated string top = 4;    // the name of each top blob
+
+  // The train / test phase for computation.
+  optional Phase phase = 10;
+
+  // The amount of weight to assign each top blob in the objective.
+  // Each layer assigns a default value, usually of either 0 or 1,
+  // to each top blob.
+  repeated float loss_weight = 5;
+
+  // Specifies training parameters (multipliers on global learning constants,
+  // and the name and other settings used for weight sharing).
+  repeated ParamSpec param = 6;
+
+  // The blobs containing the numeric parameters of the layer.
+  repeated BlobProto blobs = 7;
+
+  // Specifies whether to backpropagate to each bottom. If unspecified,
+  // Caffe will automatically infer whether each input needs backpropagation
+  // to compute parameter gradients. If set to true for some inputs,
+  // backpropagation to those inputs is forced; if set false for some inputs,
+  // backpropagation to those inputs is skipped.
+  //
+  // The size must be either 0 or equal to the number of bottoms.
+  repeated bool propagate_down = 11;
+
+  // Rules controlling whether and when a layer is included in the network,
+  // based on the current NetState.  You may specify a non-zero number of rules
+  // to include OR exclude, but not both.  If no include or exclude rules are
+  // specified, the layer is always included.  If the current NetState meets
+  // ANY (i.e., one or more) of the specified rules, the layer is
+  // included/excluded.
+  repeated NetStateRule include = 8;
+  repeated NetStateRule exclude = 9;
+
+  // Parameters for data pre-processing.
+  optional TransformationParameter transform_param = 100;
+
+  // Parameters shared by loss layers.
+  optional LossParameter loss_param = 101;
+
+  // Layer type-specific parameters.
+  //
+  // Note: certain layers may have more than one computational engine
+  // for their implementation. These layers include an Engine type and
+  // engine parameter for selecting the implementation.
+  // The default for the engine is set by the ENGINE switch at compile-time.
+  optional AccuracyParameter accuracy_param = 102;
+  optional ArgMaxParameter argmax_param = 103;
+  optional BatchNormParameter batch_norm_param = 139;
+  optional BiasParameter bias_param = 141;
+  optional ConcatParameter concat_param = 104;
+  optional ContrastiveLossParameter contrastive_loss_param = 105;
+  optional ConvolutionParameter convolution_param = 106;
+  optional CropParameter crop_param = 144;
+  optional DataParameter data_param = 107;
+  optional DropoutParameter dropout_param = 108;
+  optional DummyDataParameter dummy_data_param = 109;
+  optional EltwiseParameter eltwise_param = 110;
+  optional ELUParameter elu_param = 140;
+  optional EmbedParameter embed_param = 137;
+  optional ExpParameter exp_param = 111;
+  optional FlattenParameter flatten_param = 135;
+  optional HDF5DataParameter hdf5_data_param = 112;
+  optional HDF5OutputParameter hdf5_output_param = 113;
+  optional HingeLossParameter hinge_loss_param = 114;
+  optional ImageDataParameter image_data_param = 115;
+  optional InfogainLossParameter infogain_loss_param = 116;
+  optional InnerProductParameter inner_product_param = 117;
+  optional InputParameter input_param = 143;
+  optional LogParameter log_param = 134;
+  optional LRNParameter lrn_param = 118;
+  optional MemoryDataParameter memory_data_param = 119;
+  optional MVNParameter mvn_param = 120;
+  optional ParameterParameter parameter_param = 145;
+  optional PoolingParameter pooling_param = 121;
+  optional PowerParameter power_param = 122;
+  optional PReLUParameter prelu_param = 131;
+  optional PythonParameter python_param = 130;
+  optional RecurrentParameter recurrent_param = 146;
+  optional ReductionParameter reduction_param = 136;
+  optional ReLUParameter relu_param = 123;
+  optional ReshapeParameter reshape_param = 133;
+  optional ScaleParameter scale_param = 142;
+  optional SigmoidParameter sigmoid_param = 124;
+  optional SoftmaxParameter softmax_param = 125;
+  optional SPPParameter spp_param = 132;
+  optional SliceParameter slice_param = 126;
+  optional TanHParameter tanh_param = 127;
+  optional ThresholdParameter threshold_param = 128;
+  optional TileParameter tile_param = 138;
+  optional WindowDataParameter window_data_param = 129;
+}
+
+// Message that stores parameters used to apply transformation
+// to the data layer's data
+message TransformationParameter {
+  // For data pre-processing, we can do simple scaling and subtracting the
+  // data mean, if provided. Note that the mean subtraction is always carried
+  // out before scaling.
+  optional float scale = 1 [ default = 1 ];
+  // Specify if we want to randomly mirror data.
+  optional bool mirror = 2 [ default = false ];
+  // Specify if we would like to randomly crop an image.
+  optional uint32 crop_size = 3 [ default = 0 ];
+  // mean_file and mean_value cannot be specified at the same time
+  optional string mean_file = 4;
+  // if specified can be repeated once (would subtract it from all the channels)
+  // or can be repeated the same number of times as channels
+  // (would subtract them from the corresponding channel)
+  repeated float mean_value = 5;
+  // Force the decoded image to have 3 color channels.
+  optional bool force_color = 6 [ default = false ];
+  // Force the decoded image to have 1 color channels.
+  optional bool force_gray = 7 [ default = false ];
+}
+
+// Message that stores parameters shared by loss layers
+message LossParameter {
+  // If specified, ignore instances with the given label.
+  optional int32 ignore_label = 1;
+  // How to normalize the loss for loss layers that aggregate across batches,
+  // spatial dimensions, or other dimensions.  Currently only implemented in
+  // SoftmaxWithLoss and SigmoidCrossEntropyLoss layers.
+  enum NormalizationMode {
+    // Divide by the number of examples in the batch times spatial dimensions.
+    // Outputs that receive the ignore label will NOT be ignored in computing
+    // the normalization factor.
+    FULL = 0;
+    // Divide by the total number of output locations that do not take the
+    // ignore_label.  If ignore_label is not set, this behaves like FULL.
+    VALID = 1;
+    // Divide by the batch size.
+    BATCH_SIZE = 2;
+    // Do not normalize the loss.
+    NONE = 3;
+  }
+  // For historical reasons, the default normalization for
+  // SigmoidCrossEntropyLoss is BATCH_SIZE and *not* VALID.
+  optional NormalizationMode normalization = 3 [ default = VALID ];
+  // Deprecated.  Ignored if normalization is specified.  If normalization
+  // is not specified, then setting this to false will be equivalent to
+  // normalization = BATCH_SIZE to be consistent with previous behavior.
+  optional bool normalize = 2;
+}
+
+// Messages that store parameters used by individual layer types follow, in
+// alphabetical order.
+
+message AccuracyParameter {
+  // When computing accuracy, count as correct by comparing the true label to
+  // the top k scoring classes.  By default, only compare to the top scoring
+  // class (i.e. argmax).
+  optional uint32 top_k = 1 [ default = 1 ];
+
+  // The "label" axis of the prediction blob, whose argmax corresponds to the
+  // predicted label -- may be negative to index from the end (e.g., -1 for the
+  // last axis).  For example, if axis == 1 and the predictions are
+  // (N x C x H x W), the label blob is expected to contain N*H*W ground truth
+  // labels with integer values in {0, 1, ..., C-1}.
+  optional int32 axis = 2 [ default = 1 ];
+
+  // If specified, ignore instances with the given label.
+  optional int32 ignore_label = 3;
+}
+
+message ArgMaxParameter {
+  // If true produce pairs (argmax, maxval)
+  optional bool out_max_val = 1 [ default = false ];
+  optional uint32 top_k = 2 [ default = 1 ];
+  // The axis along which to maximise -- may be negative to index from the
+  // end (e.g., -1 for the last axis).
+  // By default ArgMaxLayer maximizes over the flattened trailing dimensions
+  // for each index of the first / num dimension.
+  optional int32 axis = 3;
+}
+
+message ConcatParameter {
+  // The axis along which to concatenate -- may be negative to index from the
+  // end (e.g., -1 for the last axis).  Other axes must have the
+  // same dimension for all the bottom blobs.
+  // By default, ConcatLayer concatenates blobs along the "channels" axis (1).
+  optional int32 axis = 2 [ default = 1 ];
+
+  // DEPRECATED: alias for "axis" -- does not support negative indexing.
+  optional uint32 concat_dim = 1 [ default = 1 ];
+}
+
+message BatchNormParameter {
+  // If false, normalization is performed over the current mini-batch
+  // and global statistics are accumulated (but not yet used) by a moving
+  // average.
+  // If true, those accumulated mean and variance values are used for the
+  // normalization.
+  // By default, it is set to false when the network is in the training
+  // phase and true when the network is in the testing phase.
+  optional bool use_global_stats = 1;
+  // What fraction of the moving average remains each iteration?
+  // Smaller values make the moving average decay faster, giving more
+  // weight to the recent values.
+  // Each iteration updates the moving average @f$S_{t-1}@f$ with the
+  // current mean @f$ Y_t @f$ by
+  // @f$ S_t = (1-\beta)Y_t + \beta \cdot S_{t-1} @f$, where @f$ \beta @f$
+  // is the moving_average_fraction parameter.
+  optional float moving_average_fraction = 2 [ default = .999 ];
+  // Small value to add to the variance estimate so that we don't divide by
+  // zero.
+  optional float eps = 3 [ default = 1e-5 ];
+}
+
+message BiasParameter {
+  // The first axis of bottom[0] (the first input Blob) along which to apply
+  // bottom[1] (the second input Blob).  May be negative to index from the end
+  // (e.g., -1 for the last axis).
+  //
+  // For example, if bottom[0] is 4D with shape 100x3x40x60, the output
+  // top[0] will have the same shape, and bottom[1] may have any of the
+  // following shapes (for the given value of axis):
+  //    (axis == 0 == -4) 100; 100x3; 100x3x40; 100x3x40x60
+  //    (axis == 1 == -3)          3;     3x40;     3x40x60
+  //    (axis == 2 == -2)                   40;       40x60
+  //    (axis == 3 == -1)                                60
+  // Furthermore, bottom[1] may have the empty shape (regardless of the value of
+  // "axis") -- a scalar bias.
+  optional int32 axis = 1 [ default = 1 ];
+
+  // (num_axes is ignored unless just one bottom is given and the bias is
+  // a learned parameter of the layer.  Otherwise, num_axes is determined by the
+  // number of axes by the second bottom.)
+  // The number of axes of the input (bottom[0]) covered by the bias
+  // parameter, or -1 to cover all axes of bottom[0] starting from `axis`.
+  // Set num_axes := 0, to add a zero-axis Blob: a scalar.
+  optional int32 num_axes = 2 [ default = 1 ];
+
+  // (filler is ignored unless just one bottom is given and the bias is
+  // a learned parameter of the layer.)
+  // The initialization for the learned bias parameter.
+  // Default is the zero (0) initialization, resulting in the BiasLayer
+  // initially performing the identity operation.
+  optional FillerParameter filler = 3;
+}
+
+message ContrastiveLossParameter {
+  // margin for dissimilar pair
+  optional float margin = 1 [ default = 1.0 ];
+  // The first implementation of this cost did not exactly match the cost of
+  // Hadsell et al 2006 -- using (margin - d^2) instead of (margin - d)^2.
+  // legacy_version = false (the default) uses (margin - d)^2 as proposed in the
+  // Hadsell paper. New models should probably use this version.
+  // legacy_version = true uses (margin - d^2). This is kept to support /
+  // reproduce existing models and results
+  optional bool legacy_version = 2 [ default = false ];
+}
+
+message ConvolutionParameter {
+  optional uint32 num_output = 1; // The number of outputs for the layer
+  optional bool bias_term = 2 [ default = true ]; // whether to have bias terms
+
+  // Pad, kernel size, and stride are all given as a single value for equal
+  // dimensions in all spatial dimensions, or once per spatial dimension.
+  repeated uint32 pad = 3;         // The padding size; defaults to 0
+  repeated uint32 kernel_size = 4; // The kernel size
+  repeated uint32 stride = 6;      // The stride; defaults to 1
+  // Factor used to dilate the kernel, (implicitly) zero-filling the resulting
+  // holes. (Kernel dilation is sometimes referred to by its use in the
+  // algorithme à trous from Holschneider et al. 1987.)
+  repeated uint32 dilation = 18; // The dilation; defaults to 1
+
+  // For 2D convolution only, the *_h and *_w versions may also be used to
+  // specify both spatial dimensions.
+  optional uint32 pad_h = 9 [ default = 0 ];  // The padding height (2D only)
+  optional uint32 pad_w = 10 [ default = 0 ]; // The padding width (2D only)
+  optional uint32 kernel_h = 11;              // The kernel height (2D only)
+  optional uint32 kernel_w = 12;              // The kernel width (2D only)
+  optional uint32 stride_h = 13;              // The stride height (2D only)
+  optional uint32 stride_w = 14;              // The stride width (2D only)
+
+  optional uint32 group = 5 [ default = 1 ]; // The group size for group conv
+
+  optional FillerParameter weight_filler = 7; // The filler for the weight
+  optional FillerParameter bias_filler = 8;   // The filler for the bias
+  enum Engine {
+    DEFAULT = 0;
+    CAFFE = 1;
+    CUDNN = 2;
+  }
+  optional Engine engine = 15 [ default = DEFAULT ];
+
+  // The axis to interpret as "channels" when performing convolution.
+  // Preceding dimensions are treated as independent inputs;
+  // succeeding dimensions are treated as "spatial".
+  // With (N, C, H, W) inputs, and axis == 1 (the default), we perform
+  // N independent 2D convolutions, sliding C-channel (or (C/g)-channels, for
+  // groups g>1) filters across the spatial axes (H, W) of the input.
+  // With (N, C, D, H, W) inputs, and axis == 1, we perform
+  // N independent 3D convolutions, sliding (C/g)-channels
+  // filters across the spatial axes (D, H, W) of the input.
+  optional int32 axis = 16 [ default = 1 ];
+
+  // Whether to force use of the general ND convolution, even if a specific
+  // implementation for blobs of the appropriate number of spatial dimensions
+  // is available. (Currently, there is only a 2D-specific convolution
+  // implementation; for input blobs with num_axes != 2, this option is
+  // ignored and the ND implementation will be used.)
+  optional bool force_nd_im2col = 17 [ default = false ];
+}
+
+message CropParameter {
+  // To crop, elements of the first bottom are selected to fit the dimensions
+  // of the second, reference bottom. The crop is configured by
+  // - the crop `axis` to pick the dimensions for cropping
+  // - the crop `offset` to set the shift for all/each dimension
+  // to align the cropped bottom with the reference bottom.
+  // All dimensions up to but excluding `axis` are preserved, while
+  // the dimensions including and trailing `axis` are cropped.
+  // If only one `offset` is set, then all dimensions are offset by this amount.
+  // Otherwise, the number of offsets must equal the number of cropped axes to
+  // shift the crop in each dimension accordingly.
+  // Note: standard dimensions are N,C,H,W so the default is a spatial crop,
+  // and `axis` may be negative to index from the end (e.g., -1 for the last
+  // axis).
+  optional int32 axis = 1 [ default = 2 ];
+  repeated uint32 offset = 2;
+}
+
+message DataParameter {
+  enum DB {
+    LEVELDB = 0;
+    LMDB = 1;
+  }
+  // Specify the data source.
+  optional string source = 1;
+  // Specify the batch size.
+  optional uint32 batch_size = 4;
+  // The rand_skip variable is for the data layer to skip a few data points
+  // to avoid all asynchronous sgd clients to start at the same point. The skip
+  // point would be set as rand_skip * rand(0,1). Note that rand_skip should not
+  // be larger than the number of keys in the database.
+  // DEPRECATED. Each solver accesses a different subset of the database.
+  optional uint32 rand_skip = 7 [ default = 0 ];
+  optional DB backend = 8 [ default = LEVELDB ];
+  // DEPRECATED. See TransformationParameter. For data pre-processing, we can do
+  // simple scaling and subtracting the data mean, if provided. Note that the
+  // mean subtraction is always carried out before scaling.
+  optional float scale = 2 [ default = 1 ];
+  optional string mean_file = 3;
+  // DEPRECATED. See TransformationParameter. Specify if we would like to
+  // randomly
+  // crop an image.
+  optional uint32 crop_size = 5 [ default = 0 ];
+  // DEPRECATED. See TransformationParameter. Specify if we want to randomly
+  // mirror
+  // data.
+  optional bool mirror = 6 [ default = false ];
+  // Force the encoded image to have 3 color channels
+  optional bool force_encoded_color = 9 [ default = false ];
+  // Prefetch queue (Increase if data feeding bandwidth varies, within the
+  // limit of device memory for GPU training)
+  optional uint32 prefetch = 10 [ default = 4 ];
+}
+
+message DropoutParameter {
+  optional float dropout_ratio = 1 [ default = 0.5 ]; // dropout ratio
+}
+
+// DummyDataLayer fills any number of arbitrarily shaped blobs with random
+// (or constant) data generated by "Fillers" (see "message FillerParameter").
+message DummyDataParameter {
+  // This layer produces N >= 1 top blobs.  DummyDataParameter must specify 1 or
+  // N
+  // shape fields, and 0, 1 or N data_fillers.
+  //
+  // If 0 data_fillers are specified, ConstantFiller with a value of 0 is used.
+  // If 1 data_filler is specified, it is applied to all top blobs.  If N are
+  // specified, the ith is applied to the ith top blob.
+  repeated FillerParameter data_filler = 1;
+  repeated BlobShape shape = 6;
+
+  // 4D dimensions -- deprecated.  Use "shape" instead.
+  repeated uint32 num = 2;
+  repeated uint32 channels = 3;
+  repeated uint32 height = 4;
+  repeated uint32 width = 5;
+}
+
+message EltwiseParameter {
+  enum EltwiseOp {
+    PROD = 0;
+    SUM = 1;
+    MAX = 2;
+  }
+  optional EltwiseOp operation = 1 [ default = SUM ]; // element-wise operation
+  repeated float coeff = 2; // blob-wise coefficient for SUM operation
+
+  // Whether to use an asymptotically slower (for >2 inputs) but stabler method
+  // of computing the gradient for the PROD operation. (No effect for SUM op.)
+  optional bool stable_prod_grad = 3 [ default = true ];
+}
+
+// Message that stores parameters used by ELULayer
+message ELUParameter {
+  // Described in:
+  // Clevert, D.-A., Unterthiner, T., & Hochreiter, S. (2015). Fast and Accurate
+  // Deep Network Learning by Exponential Linear Units (ELUs). arXiv
+  optional float alpha = 1 [ default = 1 ];
+}
+
+// Message that stores parameters used by EmbedLayer
+message EmbedParameter {
+  optional uint32 num_output = 1; // The number of outputs for the layer
+  // The input is given as integers to be interpreted as one-hot
+  // vector indices with dimension num_input.  Hence num_input should be
+  // 1 greater than the maximum possible input value.
+  optional uint32 input_dim = 2;
+
+  optional bool bias_term = 3 [ default = true ]; // Whether to use a bias term
+  optional FillerParameter weight_filler = 4;     // The filler for the weight
+  optional FillerParameter bias_filler = 5;       // The filler for the bias
+}
+
+// Message that stores parameters used by ExpLayer
+message ExpParameter {
+  // ExpLayer computes outputs y = base ^ (shift + scale * x), for base > 0.
+  // Or if base is set to the default (-1), base is set to e,
+  // so y = exp(shift + scale * x).
+  optional float base = 1 [ default = -1.0 ];
+  optional float scale = 2 [ default = 1.0 ];
+  optional float shift = 3 [ default = 0.0 ];
+}
+
+/// Message that stores parameters used by FlattenLayer
+message FlattenParameter {
+  // The first axis to flatten: all preceding axes are retained in the output.
+  // May be negative to index from the end (e.g., -1 for the last axis).
+  optional int32 axis = 1 [ default = 1 ];
+
+  // The last axis to flatten: all following axes are retained in the output.
+  // May be negative to index from the end (e.g., the default -1 for the last
+  // axis).
+  optional int32 end_axis = 2 [ default = -1 ];
+}
+
+// Message that stores parameters used by HDF5DataLayer
+message HDF5DataParameter {
+  // Specify the data source.
+  optional string source = 1;
+  // Specify the batch size.
+  optional uint32 batch_size = 2;
+
+  // Specify whether to shuffle the data.
+  // If shuffle == true, the ordering of the HDF5 files is shuffled,
+  // and the ordering of data within any given HDF5 file is shuffled,
+  // but data between different files are not interleaved; all of a file's
+  // data are output (in a random order) before moving onto another file.
+  optional bool shuffle = 3 [ default = false ];
+}
+
+message HDF5OutputParameter { optional string file_name = 1; }
+
+message HingeLossParameter {
+  enum Norm {
+    L1 = 1;
+    L2 = 2;
+  }
+  // Specify the Norm to use L1 or L2
+  optional Norm norm = 1 [ default = L1 ];
+}
+
+message ImageDataParameter {
+  // Specify the data source.
+  optional string source = 1;
+  // Specify the batch size.
+  optional uint32 batch_size = 4 [ default = 1 ];
+  // The rand_skip variable is for the data layer to skip a few data points
+  // to avoid all asynchronous sgd clients to start at the same point. The skip
+  // point would be set as rand_skip * rand(0,1). Note that rand_skip should not
+  // be larger than the number of keys in the database.
+  optional uint32 rand_skip = 7 [ default = 0 ];
+  // Whether or not ImageLayer should shuffle the list of files at every epoch.
+  optional bool shuffle = 8 [ default = false ];
+  // It will also resize images if new_height or new_width are not zero.
+  optional uint32 new_height = 9 [ default = 0 ];
+  optional uint32 new_width = 10 [ default = 0 ];
+  // Specify if the images are color or gray
+  optional bool is_color = 11 [ default = true ];
+  // DEPRECATED. See TransformationParameter. For data pre-processing, we can do
+  // simple scaling and subtracting the data mean, if provided. Note that the
+  // mean subtraction is always carried out before scaling.
+  optional float scale = 2 [ default = 1 ];
+  optional string mean_file = 3;
+  // DEPRECATED. See TransformationParameter. Specify if we would like to
+  // randomly
+  // crop an image.
+  optional uint32 crop_size = 5 [ default = 0 ];
+  // DEPRECATED. See TransformationParameter. Specify if we want to randomly
+  // mirror
+  // data.
+  optional bool mirror = 6 [ default = false ];
+  optional string root_folder = 12 [ default = "" ];
+}
+
+message InfogainLossParameter {
+  // Specify the infogain matrix source.
+  optional string source = 1;
+  optional int32 axis = 2 [ default = 1 ]; // axis of prob
+}
+
+message InnerProductParameter {
+  optional uint32 num_output = 1; // The number of outputs for the layer
+  optional bool bias_term = 2 [ default = true ]; // whether to have bias terms
+  optional FillerParameter weight_filler = 3;     // The filler for the weight
+  optional FillerParameter bias_filler = 4;       // The filler for the bias
+
+  // The first axis to be lumped into a single inner product computation;
+  // all preceding axes are retained in the output.
+  // May be negative to index from the end (e.g., -1 for the last axis).
+  optional int32 axis = 5 [ default = 1 ];
+  // Specify whether to transpose the weight matrix or not.
+  // If transpose == true, any operations will be performed on the transpose
+  // of the weight matrix. The weight matrix itself is not going to be
+  // transposed
+  // but rather the transfer flag of operations will be toggled accordingly.
+  optional bool transpose = 6 [ default = false ];
+}
+
+message InputParameter {
+  // This layer produces N >= 1 top blob(s) to be assigned manually.
+  // Define N shapes to set a shape for each top.
+  // Define 1 shape to set the same shape for every top.
+  // Define no shape to defer to reshaping manually.
+  repeated BlobShape shape = 1;
+}
+
+// Message that stores parameters used by LogLayer
+message LogParameter {
+  // LogLayer computes outputs y = log_base(shift + scale * x), for base > 0.
+  // Or if base is set to the default (-1), base is set to e,
+  // so y = ln(shift + scale * x) = log_e(shift + scale * x)
+  optional float base = 1 [ default = -1.0 ];
+  optional float scale = 2 [ default = 1.0 ];
+  optional float shift = 3 [ default = 0.0 ];
+}
+
+// Message that stores parameters used by LRNLayer
+message LRNParameter {
+  optional uint32 local_size = 1 [ default = 5 ];
+  optional float alpha = 2 [ default = 1. ];
+  optional float beta = 3 [ default = 0.75 ];
+  enum NormRegion {
+    ACROSS_CHANNELS = 0;
+    WITHIN_CHANNEL = 1;
+  }
+  optional NormRegion norm_region = 4 [ default = ACROSS_CHANNELS ];
+  optional float k = 5 [ default = 1. ];
+  enum Engine {
+    DEFAULT = 0;
+    CAFFE = 1;
+    CUDNN = 2;
+  }
+  optional Engine engine = 6 [ default = DEFAULT ];
+}
+
+message MemoryDataParameter {
+  optional uint32 batch_size = 1;
+  optional uint32 channels = 2;
+  optional uint32 height = 3;
+  optional uint32 width = 4;
+}
+
+message MVNParameter {
+  // This parameter can be set to false to normalize mean only
+  optional bool normalize_variance = 1 [ default = true ];
+
+  // This parameter can be set to true to perform DNN-like MVN
+  optional bool across_channels = 2 [ default = false ];
+
+  // Epsilon for not dividing by zero while normalizing variance
+  optional float eps = 3 [ default = 1e-9 ];
+}
+
+message ParameterParameter { optional BlobShape shape = 1; }
+
+message PoolingParameter {
+  enum PoolMethod {
+    MAX = 0;
+    AVE = 1;
+    STOCHASTIC = 2;
+  }
+  optional PoolMethod pool = 1 [ default = MAX ]; // The pooling method
+  // Pad, kernel size, and stride are all given as a single value for equal
+  // dimensions in height and width or as Y, X pairs.
+  optional uint32 pad = 4 [ default = 0 ];   // The padding size (equal in Y, X)
+  optional uint32 pad_h = 9 [ default = 0 ]; // The padding height
+  optional uint32 pad_w = 10 [ default = 0 ]; // The padding width
+  optional uint32 kernel_size = 2;            // The kernel size (square)
+  optional uint32 kernel_h = 5;               // The kernel height
+  optional uint32 kernel_w = 6;               // The kernel width
+  optional uint32 stride = 3 [ default = 1 ]; // The stride (equal in Y, X)
+  optional uint32 stride_h = 7;               // The stride height
+  optional uint32 stride_w = 8;               // The stride width
+  enum Engine {
+    DEFAULT = 0;
+    CAFFE = 1;
+    CUDNN = 2;
+  }
+  optional Engine engine = 11 [ default = DEFAULT ];
+  // If global_pooling then it will pool over the size of the bottom by doing
+  // kernel_h = bottom->height and kernel_w = bottom->width
+  optional bool global_pooling = 12 [ default = false ];
+}
+
+message PowerParameter {
+  // PowerLayer computes outputs y = (shift + scale * x) ^ power.
+  optional float power = 1 [ default = 1.0 ];
+  optional float scale = 2 [ default = 1.0 ];
+  optional float shift = 3 [ default = 0.0 ];
+}
+
+message PythonParameter {
+  optional string module = 1;
+  optional string layer = 2;
+  // This value is set to the attribute `param_str` of the `PythonLayer` object
+  // in Python before calling the `setup()` method. This could be a number,
+  // string, dictionary in Python dict format, JSON, etc. You may parse this
+  // string in `setup` method and use it in `forward` and `backward`.
+  optional string param_str = 3 [ default = ''];
+  // DEPRECATED
+  optional bool share_in_parallel = 4 [ default = false ];
+}
+
+// Message that stores parameters used by RecurrentLayer
+message RecurrentParameter {
+  // The dimension of the output (and usually hidden state) representation --
+  // must be explicitly set to non-zero.
+  optional uint32 num_output = 1 [ default = 0 ];
+
+  optional FillerParameter weight_filler = 2; // The filler for the weight
+  optional FillerParameter bias_filler = 3;   // The filler for the bias
+
+  // Whether to enable displaying debug_info in the unrolled recurrent net.
+  optional bool debug_info = 4 [ default = false ];
+
+  // Whether to add as additional inputs (bottoms) the initial hidden state
+  // blobs, and add as additional outputs (tops) the final timestep hidden state
+  // blobs.  The number of additional bottom/top blobs required depends on the
+  // recurrent architecture -- e.g., 1 for RNNs, 2 for LSTMs.
+  optional bool expose_hidden = 5 [ default = false ];
+}
+
+// Message that stores parameters used by ReductionLayer
+message ReductionParameter {
+  enum ReductionOp {
+    SUM = 1;
+    ASUM = 2;
+    SUMSQ = 3;
+    MEAN = 4;
+  }
+
+  optional ReductionOp operation = 1 [ default = SUM ]; // reduction operation
+
+  // The first axis to reduce to a scalar -- may be negative to index from the
+  // end (e.g., -1 for the last axis).
+  // (Currently, only reduction along ALL "tail" axes is supported; reduction
+  // of axis M through N, where N < num_axes - 1, is unsupported.)
+  // Suppose we have an n-axis bottom Blob with shape:
+  //     (d0, d1, d2, ..., d(m-1), dm, d(m+1), ..., d(n-1)).
+  // If axis == m, the output Blob will have shape
+  //     (d0, d1, d2, ..., d(m-1)),
+  // and the ReductionOp operation is performed (d0 * d1 * d2 * ... * d(m-1))
+  // times, each including (dm * d(m+1) * ... * d(n-1)) individual data.
+  // If axis == 0 (the default), the output Blob always has the empty shape
+  // (count 1), performing reduction across the entire input --
+  // often useful for creating new loss functions.
+  optional int32 axis = 2 [ default = 0 ];
+
+  optional float coeff = 3 [ default = 1.0 ]; // coefficient for output
+}
+
+// Message that stores parameters used by ReLULayer
+message ReLUParameter {
+  // Allow non-zero slope for negative inputs to speed up optimization
+  // Described in:
+  // Maas, A. L., Hannun, A. Y., & Ng, A. Y. (2013). Rectifier nonlinearities
+  // improve neural network acoustic models. In ICML Workshop on Deep Learning
+  // for Audio, Speech, and Language Processing.
+  optional float negative_slope = 1 [ default = 0 ];
+  enum Engine {
+    DEFAULT = 0;
+    CAFFE = 1;
+    CUDNN = 2;
+  }
+  optional Engine engine = 2 [ default = DEFAULT ];
+}
+
+message ReshapeParameter {
+  // Specify the output dimensions. If some of the dimensions are set to 0,
+  // the corresponding dimension from the bottom layer is used (unchanged).
+  // Exactly one dimension may be set to -1, in which case its value is
+  // inferred from the count of the bottom blob and the remaining dimensions.
+  // For example, suppose we want to reshape a 2D blob "input" with shape 2 x 8:
+  //
+  //   layer {
+  //     type: "Reshape" bottom: "input" top: "output"
+  //     reshape_param { ... }
+  //   }
+  //
+  // If "input" is 2D with shape 2 x 8, then the following reshape_param
+  // specifications are all equivalent, producing a 3D blob "output" with shape
+  // 2 x 2 x 4:
+  //
+  //   reshape_param { shape { dim:  2  dim: 2  dim:  4 } }
+  //   reshape_param { shape { dim:  0  dim: 2  dim:  4 } }
+  //   reshape_param { shape { dim:  0  dim: 2  dim: -1 } }
+  //   reshape_param { shape { dim:  0  dim:-1  dim:  4 } }
+  //
+  optional BlobShape shape = 1;
+
+  // axis and num_axes control the portion of the bottom blob's shape that are
+  // replaced by (included in) the reshape. By default (axis == 0 and
+  // num_axes == -1), the entire bottom blob shape is included in the reshape,
+  // and hence the shape field must specify the entire output shape.
+  //
+  // axis may be non-zero to retain some portion of the beginning of the input
+  // shape (and may be negative to index from the end; e.g., -1 to begin the
+  // reshape after the last axis, including nothing in the reshape,
+  // -2 to include only the last axis, etc.).
+  //
+  // For example, suppose "input" is a 2D blob with shape 2 x 8.
+  // Then the following ReshapeLayer specifications are all equivalent,
+  // producing a blob "output" with shape 2 x 2 x 4:
+  //
+  //   reshape_param { shape { dim: 2  dim: 2  dim: 4 } }
+  //   reshape_param { shape { dim: 2  dim: 4 } axis:  1 }
+  //   reshape_param { shape { dim: 2  dim: 4 } axis: -3 }
+  //
+  // num_axes specifies the extent of the reshape.
+  // If num_axes >= 0 (and axis >= 0), the reshape will be performed only on
+  // input axes in the range [axis, axis+num_axes].
+  // num_axes may also be -1, the default, to include all remaining axes
+  // (starting from axis).
+  //
+  // For example, suppose "input" is a 2D blob with shape 2 x 8.
+  // Then the following ReshapeLayer specifications are equivalent,
+  // producing a blob "output" with shape 1 x 2 x 8.
+  //
+  //   reshape_param { shape { dim:  1  dim: 2  dim:  8 } }
+  //   reshape_param { shape { dim:  1  dim: 2  }  num_axes: 1 }
+  //   reshape_param { shape { dim:  1  }  num_axes: 0 }
+  //
+  // On the other hand, these would produce output blob shape 2 x 1 x 8:
+  //
+  //   reshape_param { shape { dim: 2  dim: 1  dim: 8  }  }
+  //   reshape_param { shape { dim: 1 }  axis: 1  num_axes: 0 }
+  //
+  optional int32 axis = 2 [ default = 0 ];
+  optional int32 num_axes = 3 [ default = -1 ];
+}
+
+message ScaleParameter {
+  // The first axis of bottom[0] (the first input Blob) along which to apply
+  // bottom[1] (the second input Blob).  May be negative to index from the end
+  // (e.g., -1 for the last axis).
+  //
+  // For example, if bottom[0] is 4D with shape 100x3x40x60, the output
+  // top[0] will have the same shape, and bottom[1] may have any of the
+  // following shapes (for the given value of axis):
+  //    (axis == 0 == -4) 100; 100x3; 100x3x40; 100x3x40x60
+  //    (axis == 1 == -3)          3;     3x40;     3x40x60
+  //    (axis == 2 == -2)                   40;       40x60
+  //    (axis == 3 == -1)                                60
+  // Furthermore, bottom[1] may have the empty shape (regardless of the value of
+  // "axis") -- a scalar multiplier.
+  optional int32 axis = 1 [ default = 1 ];
+
+  // (num_axes is ignored unless just one bottom is given and the scale is
+  // a learned parameter of the layer.  Otherwise, num_axes is determined by the
+  // number of axes by the second bottom.)
+  // The number of axes of the input (bottom[0]) covered by the scale
+  // parameter, or -1 to cover all axes of bottom[0] starting from `axis`.
+  // Set num_axes := 0, to multiply with a zero-axis Blob: a scalar.
+  optional int32 num_axes = 2 [ default = 1 ];
+
+  // (filler is ignored unless just one bottom is given and the scale is
+  // a learned parameter of the layer.)
+  // The initialization for the learned scale parameter.
+  // Default is the unit (1) initialization, resulting in the ScaleLayer
+  // initially performing the identity operation.
+  optional FillerParameter filler = 3;
+
+  // Whether to also learn a bias (equivalent to a ScaleLayer+BiasLayer, but
+  // may be more efficient).  Initialized with bias_filler (defaults to 0).
+  optional bool bias_term = 4 [ default = false ];
+  optional FillerParameter bias_filler = 5;
+}
+
+message SigmoidParameter {
+  enum Engine {
+    DEFAULT = 0;
+    CAFFE = 1;
+    CUDNN = 2;
+  }
+  optional Engine engine = 1 [ default = DEFAULT ];
+}
+
+message SliceParameter {
+  // The axis along which to slice -- may be negative to index from the end
+  // (e.g., -1 for the last axis).
+  // By default, SliceLayer concatenates blobs along the "channels" axis (1).
+  optional int32 axis = 3 [ default = 1 ];
+  repeated uint32 slice_point = 2;
+
+  // DEPRECATED: alias for "axis" -- does not support negative indexing.
+  optional uint32 slice_dim = 1 [ default = 1 ];
+}
+
+// Message that stores parameters used by SoftmaxLayer, SoftmaxWithLossLayer
+message SoftmaxParameter {
+  enum Engine {
+    DEFAULT = 0;
+    CAFFE = 1;
+    CUDNN = 2;
+  }
+  optional Engine engine = 1 [ default = DEFAULT ];
+
+  // The axis along which to perform the softmax -- may be negative to index
+  // from the end (e.g., -1 for the last axis).
+  // Any other axes will be evaluated as independent softmaxes.
+  optional int32 axis = 2 [ default = 1 ];
+}
+
+message TanHParameter {
+  enum Engine {
+    DEFAULT = 0;
+    CAFFE = 1;
+    CUDNN = 2;
+  }
+  optional Engine engine = 1 [ default = DEFAULT ];
+}
+
+// Message that stores parameters used by TileLayer
+message TileParameter {
+  // The index of the axis to tile.
+  optional int32 axis = 1 [ default = 1 ];
+
+  // The number of copies (tiles) of the blob to output.
+  optional int32 tiles = 2;
+}
+
+// Message that stores parameters used by ThresholdLayer
+message ThresholdParameter {
+  optional float threshold = 1 [ default = 0 ]; // Strictly positive values
+}
+
+message WindowDataParameter {
+  // Specify the data source.
+  optional string source = 1;
+  // For data pre-processing, we can do simple scaling and subtracting the
+  // data mean, if provided. Note that the mean subtraction is always carried
+  // out before scaling.
+  optional float scale = 2 [ default = 1 ];
+  optional string mean_file = 3;
+  // Specify the batch size.
+  optional uint32 batch_size = 4;
+  // Specify if we would like to randomly crop an image.
+  optional uint32 crop_size = 5 [ default = 0 ];
+  // Specify if we want to randomly mirror data.
+  optional bool mirror = 6 [ default = false ];
+  // Foreground (object) overlap threshold
+  optional float fg_threshold = 7 [ default = 0.5 ];
+  // Background (non-object) overlap threshold
+  optional float bg_threshold = 8 [ default = 0.5 ];
+  // Fraction of batch that should be foreground objects
+  optional float fg_fraction = 9 [ default = 0.25 ];
+  // Amount of contextual padding to add around a window
+  // (used only by the window_data_layer)
+  optional uint32 context_pad = 10 [ default = 0 ];
+  // Mode for cropping out a detection window
+  // warp: cropped window is warped to a fixed size and aspect ratio
+  // square: the tightest square around the window is cropped
+  optional string crop_mode = 11 [ default = "warp" ];
+  // cache_images: will load all images in memory for faster access
+  optional bool cache_images = 12 [ default = false ];
+  // append root_folder to locate images
+  optional string root_folder = 13 [ default = "" ];
+}
+
+message SPPParameter {
+  enum PoolMethod {
+    MAX = 0;
+    AVE = 1;
+    STOCHASTIC = 2;
+  }
+  optional uint32 pyramid_height = 1;
+  optional PoolMethod pool = 2 [ default = MAX ]; // The pooling method
+  enum Engine {
+    DEFAULT = 0;
+    CAFFE = 1;
+    CUDNN = 2;
+  }
+  optional Engine engine = 6 [ default = DEFAULT ];
+}
+
+// DEPRECATED: use LayerParameter.
+message V1LayerParameter {
+  repeated string bottom = 2;
+  repeated string top = 3;
+  optional string name = 4;
+  repeated NetStateRule include = 32;
+  repeated NetStateRule exclude = 33;
+  enum LayerType {
+    NONE = 0;
+    ABSVAL = 35;
+    ACCURACY = 1;
+    ARGMAX = 30;
+    BNLL = 2;
+    CONCAT = 3;
+    CONTRASTIVE_LOSS = 37;
+    CONVOLUTION = 4;
+    DATA = 5;
+    DECONVOLUTION = 39;
+    DROPOUT = 6;
+    DUMMY_DATA = 32;
+    EUCLIDEAN_LOSS = 7;
+    ELTWISE = 25;
+    EXP = 38;
+    FLATTEN = 8;
+    HDF5_DATA = 9;
+    HDF5_OUTPUT = 10;
+    HINGE_LOSS = 28;
+    IM2COL = 11;
+    IMAGE_DATA = 12;
+    INFOGAIN_LOSS = 13;
+    INNER_PRODUCT = 14;
+    LRN = 15;
+    MEMORY_DATA = 29;
+    MULTINOMIAL_LOGISTIC_LOSS = 16;
+    MVN = 34;
+    POOLING = 17;
+    POWER = 26;
+    RELU = 18;
+    SIGMOID = 19;
+    SIGMOID_CROSS_ENTROPY_LOSS = 27;
+    SILENCE = 36;
+    SOFTMAX = 20;
+    SOFTMAX_LOSS = 21;
+    SPLIT = 22;
+    SLICE = 33;
+    TANH = 23;
+    WINDOW_DATA = 24;
+    THRESHOLD = 31;
+  }
+  optional LayerType type = 5;
+  repeated BlobProto blobs = 6;
+  repeated string param = 1001;
+  repeated DimCheckMode blob_share_mode = 1002;
+  enum DimCheckMode {
+    STRICT = 0;
+    PERMISSIVE = 1;
+  }
+  repeated float blobs_lr = 7;
+  repeated float weight_decay = 8;
+  repeated float loss_weight = 35;
+  optional AccuracyParameter accuracy_param = 27;
+  optional ArgMaxParameter argmax_param = 23;
+  optional ConcatParameter concat_param = 9;
+  optional ContrastiveLossParameter contrastive_loss_param = 40;
+  optional ConvolutionParameter convolution_param = 10;
+  optional DataParameter data_param = 11;
+  optional DropoutParameter dropout_param = 12;
+  optional DummyDataParameter dummy_data_param = 26;
+  optional EltwiseParameter eltwise_param = 24;
+  optional ExpParameter exp_param = 41;
+  optional HDF5DataParameter hdf5_data_param = 13;
+  optional HDF5OutputParameter hdf5_output_param = 14;
+  optional HingeLossParameter hinge_loss_param = 29;
+  optional ImageDataParameter image_data_param = 15;
+  optional InfogainLossParameter infogain_loss_param = 16;
+  optional InnerProductParameter inner_product_param = 17;
+  optional LRNParameter lrn_param = 18;
+  optional MemoryDataParameter memory_data_param = 22;
+  optional MVNParameter mvn_param = 34;
+  optional PoolingParameter pooling_param = 19;
+  optional PowerParameter power_param = 21;
+  optional ReLUParameter relu_param = 30;
+  optional SigmoidParameter sigmoid_param = 38;
+  optional SoftmaxParameter softmax_param = 39;
+  optional SliceParameter slice_param = 31;
+  optional TanHParameter tanh_param = 37;
+  optional ThresholdParameter threshold_param = 25;
+  optional WindowDataParameter window_data_param = 20;
+  optional TransformationParameter transform_param = 36;
+  optional LossParameter loss_param = 42;
+  optional V0LayerParameter layer = 1;
+}
+
+// DEPRECATED: V0LayerParameter is the old way of specifying layer parameters
+// in Caffe.  We keep this message type around for legacy support.
+message V0LayerParameter {
+  optional string name = 1; // the layer name
+  optional string type = 2; // the string to specify the layer type
+
+  // Parameters to specify layers with inner products.
+  optional uint32 num_output = 3; // The number of outputs for the layer
+  optional bool biasterm = 4 [ default = true ]; // whether to have bias terms
+  optional FillerParameter weight_filler = 5;    // The filler for the weight
+  optional FillerParameter bias_filler = 6;      // The filler for the bias
+
+  optional uint32 pad = 7 [ default = 0 ];     // The padding size
+  optional uint32 kernelsize = 8;              // The kernel size
+  optional uint32 group = 9 [ default = 1 ];   // The group size for group conv
+  optional uint32 stride = 10 [ default = 1 ]; // The stride
+  enum PoolMethod {
+    MAX = 0;
+    AVE = 1;
+    STOCHASTIC = 2;
+  }
+  optional PoolMethod pool = 11 [ default = MAX ];     // The pooling method
+  optional float dropout_ratio = 12 [ default = 0.5 ]; // dropout ratio
+
+  optional uint32 local_size = 13 [ default = 5 ]; // for local response norm
+  optional float alpha = 14 [ default = 1. ];      // for local response norm
+  optional float beta = 15 [ default = 0.75 ];     // for local response norm
+  optional float k = 22 [ default = 1. ];
+
+  // For data layers, specify the data source
+  optional string source = 16;
+  // For data pre-processing, we can do simple scaling and subtracting the
+  // data mean, if provided. Note that the mean subtraction is always carried
+  // out before scaling.
+  optional float scale = 17 [ default = 1 ];
+  optional string meanfile = 18;
+  // For data layers, specify the batch size.
+  optional uint32 batchsize = 19;
+  // For data layers, specify if we would like to randomly crop an image.
+  optional uint32 cropsize = 20 [ default = 0 ];
+  // For data layers, specify if we want to randomly mirror data.
+  optional bool mirror = 21 [ default = false ];
+
+  // The blobs containing the numeric parameters of the layer
+  repeated BlobProto blobs = 50;
+  // The ratio that is multiplied on the global learning rate. If you want to
+  // set the learning ratio for one blob, you need to set it for all blobs.
+  repeated float blobs_lr = 51;
+  // The weight decay that is multiplied on the global weight decay.
+  repeated float weight_decay = 52;
+
+  // The rand_skip variable is for the data layer to skip a few data points
+  // to avoid all asynchronous sgd clients to start at the same point. The skip
+  // point would be set as rand_skip * rand(0,1). Note that rand_skip should not
+  // be larger than the number of keys in the database.
+  optional uint32 rand_skip = 53 [ default = 0 ];
+
+  // Fields related to detection (det_*)
+  // foreground (object) overlap threshold
+  optional float det_fg_threshold = 54 [ default = 0.5 ];
+  // background (non-object) overlap threshold
+  optional float det_bg_threshold = 55 [ default = 0.5 ];
+  // Fraction of batch that should be foreground objects
+  optional float det_fg_fraction = 56 [ default = 0.25 ];
+
+  // optional bool OBSOLETE_can_clobber = 57 [default = true];
+
+  // Amount of contextual padding to add around a window
+  // (used only by the window_data_layer)
+  optional uint32 det_context_pad = 58 [ default = 0 ];
+
+  // Mode for cropping out a detection window
+  // warp: cropped window is warped to a fixed size and aspect ratio
+  // square: the tightest square around the window is cropped
+  optional string det_crop_mode = 59 [ default = "warp" ];
+
+  // For ReshapeLayer, one needs to specify the new dimensions.
+  optional int32 new_num = 60 [ default = 0 ];
+  optional int32 new_channels = 61 [ default = 0 ];
+  optional int32 new_height = 62 [ default = 0 ];
+  optional int32 new_width = 63 [ default = 0 ];
+
+  // Whether or not ImageLayer should shuffle the list of files at every epoch.
+  // It will also resize images if new_height or new_width are not zero.
+  optional bool shuffle_images = 64 [ default = false ];
+
+  // For ConcatLayer, one needs to specify the dimension for concatenation, and
+  // the other dimensions must be the same for all the bottom blobs.
+  // By default it will concatenate blobs along the channels dimension.
+  optional uint32 concat_dim = 65 [ default = 1 ];
+
+  optional HDF5OutputParameter hdf5_output_param = 1001;
+}
+
+message PReLUParameter {
+  // Parametric ReLU described in K. He et al, Delving Deep into Rectifiers:
+  // Surpassing Human-Level Performance on ImageNet Classification, 2015.
+
+  // Initial value of a_i. Default is a_i=0.25 for all i.
+  optional FillerParameter filler = 1;
+  // Whether or not slope parameters are shared across channels.
+  optional bool channel_shared = 2 [ default = false ];
+}
diff --git a/fluid/image_classification/caffe2fluid/proto/compile.sh b/fluid/image_classification/caffe2fluid/proto/compile.sh
new file mode 100644
index 0000000000..f621e0066d
--- /dev/null
+++ b/fluid/image_classification/caffe2fluid/proto/compile.sh
@@ -0,0 +1,28 @@
+#!/bin/bash
+
+#function:
+#   script used to generate caffepb.py from caffe.proto using protoc
+#
+
+PROTOC=`which protoc`
+if [[ -z $PROTOC ]];then
+    echo "not found protoc, you should first install it following this[https://github.com/google/protobuf/releases]"
+    exit 1
+fi
+
+WORK_ROOT=$(dirname `readlink -f "$BASH_SOURCE[0]"`)
+PY_NAME="$WORK_ROOT/caffepb.py"
+$PROTOC --proto_path=$WORK_ROOT --python_out=$WORK_ROOT $WORK_ROOT/caffe.proto
+ret=$?
+
+if [ $ret -eq 0 ];then
+    mv $WORK_ROOT/caffe_pb2.py $PY_NAME
+fi
+
+if [ -e "$PY_NAME" ];then
+    echo "succeed to generate [$PY_NAME]"
+    exit 0
+else
+    echo "failed to generate [$PY_NAME]"
+fi
+exit $ret
diff --git a/fluid/image_classification/mobilenet.py b/fluid/image_classification/mobilenet.py
new file mode 100644
index 0000000000..adfd6868f4
--- /dev/null
+++ b/fluid/image_classification/mobilenet.py
@@ -0,0 +1,224 @@
+import os
+
+import paddle.v2 as paddle
+import paddle.fluid as fluid
+from paddle.fluid.initializer import MSRA
+from paddle.fluid.param_attr import ParamAttr
+
+parameter_attr = ParamAttr(initializer=MSRA())
+
+
+def conv_bn_layer(input,
+                  filter_size,
+                  num_filters,
+                  stride,
+                  padding,
+                  channels=None,
+                  num_groups=1,
+                  act='relu',
+                  use_cudnn=True):
+    conv = fluid.layers.conv2d(
+        input=input,
+        num_filters=num_filters,
+        filter_size=filter_size,
+        stride=stride,
+        padding=padding,
+        groups=num_groups,
+        act=None,
+        use_cudnn=use_cudnn,
+        param_attr=parameter_attr,
+        bias_attr=False)
+    return fluid.layers.batch_norm(input=conv, act=act)
+
+
+def depthwise_separable(input, num_filters1, num_filters2, num_groups, stride,
+                        scale):
+    """
+    """
+    depthwise_conv = conv_bn_layer(
+        input=input,
+        filter_size=3,
+        num_filters=int(num_filters1 * scale),
+        stride=stride,
+        padding=1,
+        num_groups=int(num_groups * scale),
+        use_cudnn=False)
+
+    pointwise_conv = conv_bn_layer(
+        input=depthwise_conv,
+        filter_size=1,
+        num_filters=int(num_filters2 * scale),
+        stride=1,
+        padding=0)
+    return pointwise_conv
+
+
+def mobile_net(img, class_dim, scale=1.0):
+
+    # conv1: 112x112
+    tmp = conv_bn_layer(
+        img,
+        filter_size=3,
+        channels=3,
+        num_filters=int(32 * scale),
+        stride=2,
+        padding=1)
+
+    # 56x56
+    tmp = depthwise_separable(
+        tmp,
+        num_filters1=32,
+        num_filters2=64,
+        num_groups=32,
+        stride=1,
+        scale=scale)
+
+    tmp = depthwise_separable(
+        tmp,
+        num_filters1=64,
+        num_filters2=128,
+        num_groups=64,
+        stride=2,
+        scale=scale)
+
+    # 28x28
+    tmp = depthwise_separable(
+        tmp,
+        num_filters1=128,
+        num_filters2=128,
+        num_groups=128,
+        stride=1,
+        scale=scale)
+
+    tmp = depthwise_separable(
+        tmp,
+        num_filters1=128,
+        num_filters2=256,
+        num_groups=128,
+        stride=2,
+        scale=scale)
+
+    # 14x14
+    tmp = depthwise_separable(
+        tmp,
+        num_filters1=256,
+        num_filters2=256,
+        num_groups=256,
+        stride=1,
+        scale=scale)
+
+    tmp = depthwise_separable(
+        tmp,
+        num_filters1=256,
+        num_filters2=512,
+        num_groups=256,
+        stride=2,
+        scale=scale)
+
+    # 14x14
+    for i in range(5):
+        tmp = depthwise_separable(
+            tmp,
+            num_filters1=512,
+            num_filters2=512,
+            num_groups=512,
+            stride=1,
+            scale=scale)
+    # 7x7
+    tmp = depthwise_separable(
+        tmp,
+        num_filters1=512,
+        num_filters2=1024,
+        num_groups=512,
+        stride=2,
+        scale=scale)
+
+    tmp = depthwise_separable(
+        tmp,
+        num_filters1=1024,
+        num_filters2=1024,
+        num_groups=1024,
+        stride=1,
+        scale=scale)
+
+    tmp = fluid.layers.pool2d(
+        input=tmp,
+        pool_size=0,
+        pool_stride=1,
+        pool_type='avg',
+        global_pooling=True)
+
+    tmp = fluid.layers.fc(input=tmp,
+                          size=class_dim,
+                          act='softmax',
+                          param_attr=parameter_attr)
+    return tmp
+
+
+def train(learning_rate, batch_size, num_passes, model_save_dir='model'):
+    class_dim = 102
+    image_shape = [3, 224, 224]
+
+    image = fluid.layers.data(name='image', shape=image_shape, dtype='float32')
+    label = fluid.layers.data(name='label', shape=[1], dtype='int64')
+
+    out = mobile_net(image, class_dim=class_dim)
+
+    cost = fluid.layers.cross_entropy(input=out, label=label)
+    avg_cost = fluid.layers.mean(x=cost)
+
+    optimizer = fluid.optimizer.Momentum(
+        learning_rate=learning_rate,
+        momentum=0.9,
+        regularization=fluid.regularizer.L2Decay(5 * 1e-5))
+    opts = optimizer.minimize(avg_cost)
+
+    b_size_var = fluid.layers.create_tensor(dtype='int64')
+    b_acc_var = fluid.layers.accuracy(input=out, label=label, total=b_size_var)
+
+    inference_program = fluid.default_main_program().clone()
+    with fluid.program_guard(inference_program):
+        inference_program = fluid.io.get_inference_program(
+            target_vars=[b_acc_var, b_size_var])
+
+    place = fluid.CPUPlace()
+    exe = fluid.Executor(place)
+    exe.run(fluid.default_startup_program())
+
+    train_reader = paddle.batch(
+        paddle.dataset.flowers.train(), batch_size=batch_size)
+    test_reader = paddle.batch(
+        paddle.dataset.flowers.test(), batch_size=batch_size)
+    feeder = fluid.DataFeeder(place=place, feed_list=[image, label])
+
+    train_pass_acc_evaluator = fluid.average.WeightedAverage()
+    test_pass_acc_evaluator = fluid.average.WeightedAverage()
+    for pass_id in range(num_passes):
+        train_pass_acc_evaluator.reset()
+        for batch_id, data in enumerate(train_reader()):
+            loss, acc, size = exe.run(
+                fluid.default_main_program(),
+                feed=feeder.feed(data),
+                fetch_list=[avg_cost, b_acc_var, b_size_var])
+            train_pass_acc_evaluator.add(value=acc, weight=size)
+            print("Pass {0}, batch {1}, loss {2}, acc {3}".format(
+                pass_id, batch_id, loss[0], acc[0]))
+
+        test_pass_acc_evaluator.reset()
+        for data in test_reader():
+            loss, acc, size = exe.run(
+                inference_program,
+                feed=feeder.feed(data),
+                fetch_list=[avg_cost, b_acc_var, b_size_var])
+            test_pass_acc_evaluator.add(value=acc, weight=size)
+        print("End pass {0}, train_acc {1}, test_acc {2}".format(
+            pass_id,
+            train_pass_acc_evaluator.eval(), test_pass_acc_evaluator.eval()))
+        if pass_id % 10 == 0:
+            model_path = os.path.join(model_save_dir, str(pass_id))
+            print 'save models to %s' % (model_path)
+            fluid.io.save_inference_model(model_path, ['image'], [out], exe)
+
+
+if __name__ == '__main__':
+    train(learning_rate=0.005, batch_size=40, num_passes=300)
diff --git a/fluid/image_classification/se_resnext.py b/fluid/image_classification/se_resnext.py
index 99a62347da..c2b2d680fc 100644
--- a/fluid/image_classification/se_resnext.py
+++ b/fluid/image_classification/se_resnext.py
@@ -1,6 +1,6 @@
 import os
 import paddle.v2 as paddle
-import paddle.v2.fluid as fluid
+import paddle.fluid as fluid
 import reader
 
 
@@ -103,66 +103,87 @@ def train(learning_rate,
           batch_size,
           num_passes,
           init_model=None,
-          model_save_dir='model'):
+          model_save_dir='model',
+          parallel=True):
     class_dim = 1000
     image_shape = [3, 224, 224]
 
     image = fluid.layers.data(name='image', shape=image_shape, dtype='float32')
     label = fluid.layers.data(name='label', shape=[1], dtype='int64')
 
-    out = SE_ResNeXt(input=image, class_dim=class_dim)
-
-    cost = fluid.layers.cross_entropy(input=out, label=label)
-    avg_cost = fluid.layers.mean(x=cost)
+    if parallel:
+        places = fluid.layers.get_places()
+        pd = fluid.layers.ParallelDo(places)
+
+        with pd.do():
+            image_ = pd.read_input(image)
+            label_ = pd.read_input(label)
+            out = SE_ResNeXt(input=image_, class_dim=class_dim)
+            cost = fluid.layers.cross_entropy(input=out, label=label_)
+            avg_cost = fluid.layers.mean(x=cost)
+            accuracy = fluid.layers.accuracy(input=out, label=label_)
+            pd.write_output(avg_cost)
+            pd.write_output(accuracy)
+
+        avg_cost, accuracy = pd()
+        avg_cost = fluid.layers.mean(x=avg_cost)
+        accuracy = fluid.layers.mean(x=accuracy)
+    else:
+        out = SE_ResNeXt(input=image, class_dim=class_dim)
+        cost = fluid.layers.cross_entropy(input=out, label=label)
+        avg_cost = fluid.layers.mean(x=cost)
+        accuracy = fluid.layers.accuracy(input=out, label=label)
 
     optimizer = fluid.optimizer.Momentum(
         learning_rate=learning_rate,
         momentum=0.9,
         regularization=fluid.regularizer.L2Decay(1e-4))
     opts = optimizer.minimize(avg_cost)
-    accuracy = fluid.evaluator.Accuracy(input=out, label=label)
 
     inference_program = fluid.default_main_program().clone()
     with fluid.program_guard(inference_program):
-        test_accuracy = fluid.evaluator.Accuracy(input=out, label=label)
-        test_target = [avg_cost] + test_accuracy.metrics + test_accuracy.states
-        inference_program = fluid.io.get_inference_program(test_target)
+        inference_program = fluid.io.get_inference_program([avg_cost, accuracy])
 
     place = fluid.CUDAPlace(0)
     exe = fluid.Executor(place)
     exe.run(fluid.default_startup_program())
 
     if init_model is not None:
-        fluid.io.load_persistables_if_exist(exe, init_model)
+        fluid.io.load_persistables(exe, init_model)
 
     train_reader = paddle.batch(reader.train(), batch_size=batch_size)
     test_reader = paddle.batch(reader.test(), batch_size=batch_size)
     feeder = fluid.DataFeeder(place=place, feed_list=[image, label])
 
     for pass_id in range(num_passes):
-        accuracy.reset(exe)
         for batch_id, data in enumerate(train_reader()):
-            loss, acc = exe.run(fluid.default_main_program(),
-                                feed=feeder.feed(data),
-                                fetch_list=[avg_cost] + accuracy.metrics)
-            print("Pass {0}, batch {1}, loss {2}, acc {3}".format(
-                pass_id, batch_id, loss[0], acc[0]))
-        pass_acc = accuracy.eval(exe)
-
-        test_accuracy.reset(exe)
+            loss = exe.run(fluid.default_main_program(),
+                           feed=feeder.feed(data),
+                           fetch_list=[avg_cost])
+            print("Pass {0}, batch {1}, loss {2}".format(pass_id, batch_id,
+                                                         float(loss[0])))
+
+        total_loss = 0.0
+        total_acc = 0.0
+        total_batch = 0
         for data in test_reader():
             loss, acc = exe.run(inference_program,
                                 feed=feeder.feed(data),
-                                fetch_list=[avg_cost] + test_accuracy.metrics)
-        test_pass_acc = test_accuracy.eval(exe)
-        print("End pass {0}, train_acc {1}, test_acc {2}".format(
-            pass_id, pass_acc, test_pass_acc))
+                                fetch_list=[avg_cost, accuracy])
+            total_loss += float(loss)
+            total_acc += float(acc)
+            total_batch += 1
+        print("End pass {0}, test_loss {1}, test_acc {2}".format(
+            pass_id, total_loss / total_batch, total_acc / total_batch))
 
         model_path = os.path.join(model_save_dir, str(pass_id))
-        if not os.path.isdir(model_path):
-            os.makedirs(model_path)
-        fluid.io.save_persistables(exe, model_path)
+        fluid.io.save_inference_model(model_path, ['image'], [out], exe)
 
 
 if __name__ == '__main__':
-    train(learning_rate=0.1, batch_size=8, num_passes=100, init_model=None)
+    train(
+        learning_rate=0.1,
+        batch_size=8,
+        num_passes=100,
+        init_model=None,
+        parallel=False)
diff --git a/fluid/neural_machine_translation/README.md b/fluid/neural_machine_translation/README.md
new file mode 100644
index 0000000000..a0271ad42e
--- /dev/null
+++ b/fluid/neural_machine_translation/README.md
@@ -0,0 +1,9 @@
+The minimum PaddlePaddle version needed for the code sample in this directory is the lastest develop branch. If you are on a version of PaddlePaddle earlier than this, [please update your installation](http://www.paddlepaddle.org/docs/develop/documentation/en/build_and_install/pip_install_en.html).
+
+---
+
+This is a collection of example models for neural machine translation and neural sequence modeling.
+
+### TODO
+
+This project is still under active development.
diff --git a/fluid/neural_machine_translation/transformer/.gitignore b/fluid/neural_machine_translation/transformer/.gitignore
new file mode 100644
index 0000000000..0d20b6487c
--- /dev/null
+++ b/fluid/neural_machine_translation/transformer/.gitignore
@@ -0,0 +1 @@
+*.pyc
diff --git a/fluid/neural_machine_translation/transformer/README.md b/fluid/neural_machine_translation/transformer/README.md
new file mode 100644
index 0000000000..6fea167b5e
--- /dev/null
+++ b/fluid/neural_machine_translation/transformer/README.md
@@ -0,0 +1,23 @@
+The minimum PaddlePaddle version needed for the code sample in this directory is the lastest develop branch. If you are on a version of PaddlePaddle earlier than this, [please update your installation](http://www.paddlepaddle.org/docs/develop/documentation/en/build_and_install/pip_install_en.html).
+
+---
+
+# Attention is All You Need: A Paddle Fluid implementation
+
+This is a Paddle Fluid implementation of the Transformer model in [Attention is All You Need]() (Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, arxiv, 2017).
+
+If you use the dataset/code in your research, please cite the paper:
+
+```text
+@inproceedings{vaswani2017attention,
+  title={Attention is all you need},
+  author={Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, {\L}ukasz and Polosukhin, Illia},
+  booktitle={Advances in Neural Information Processing Systems},
+  pages={6000--6010},
+  year={2017}
+}
+```
+
+### TODO
+
+This project is still under active development.
diff --git a/fluid/neural_machine_translation/transformer/config.py b/fluid/neural_machine_translation/transformer/config.py
new file mode 100644
index 0000000000..71e4314953
--- /dev/null
+++ b/fluid/neural_machine_translation/transformer/config.py
@@ -0,0 +1,108 @@
+class TrainTaskConfig(object):
+    use_gpu = False
+    # the epoch number to train.
+    pass_num = 2
+
+    # the number of sequences contained in a mini-batch.
+    batch_size = 64
+
+    # the hyper parameters for Adam optimizer.
+    learning_rate = 0.001
+    beta1 = 0.9
+    beta2 = 0.98
+    eps = 1e-9
+
+    # the parameters for learning rate scheduling.
+    warmup_steps = 4000
+
+    # the directory for saving trained models.
+    model_dir = "trained_models"
+
+
+class InferTaskConfig(object):
+    use_gpu = False
+    # the number of examples in one run for sequence generation.
+    # currently the batch size can only be set to 1.
+    batch_size = 1
+
+    # the parameters for beam search.
+    beam_size = 5
+    max_length = 30
+    # the number of decoded sentences to output.
+    n_best = 1
+
+    # the directory for loading the trained model.
+    model_path = "trained_models/pass_1.infer.model"
+
+
+class ModelHyperParams(object):
+    # Dictionary size for source and target language. This model directly uses
+    # paddle.dataset.wmt16 in which <bos>, <eos> and <unk> token has
+    # alreay been added, but the <pad> token is not added. Transformer requires
+    # sequences in a mini-batch are padded to have the same length. A <pad> token is
+    # added into the original dictionary in paddle.dateset.wmt16.
+
+    # size of source word dictionary.
+    src_vocab_size = 10000
+    # index for <pad> token in source language.
+    src_pad_idx = src_vocab_size
+
+    # size of target word dictionay
+    trg_vocab_size = 10000
+    # index for <pad> token in target language.
+    trg_pad_idx = trg_vocab_size
+
+    # index for <bos> token
+    bos_idx = 0
+    # index for <eos> token
+    eos_idx = 1
+
+    # position value corresponding to the <pad> token.
+    pos_pad_idx = 0
+
+    # max length of sequences. It should plus 1 to include position
+    # padding token for position encoding.
+    max_length = 50
+
+    # the dimension for word embeddings, which is also the last dimension of
+    # the input and output of multi-head attention, position-wise feed-forward
+    # networks, encoder and decoder.
+
+    d_model = 512
+    # size of the hidden layer in position-wise feed-forward networks.
+    d_inner_hid = 1024
+    # the dimension that keys are projected to for dot-product attention.
+    d_key = 64
+    # the dimension that values are projected to for dot-product attention.
+    d_value = 64
+    # number of head used in multi-head attention.
+    n_head = 8
+    # number of sub-layers to be stacked in the encoder and decoder.
+    n_layer = 6
+    # dropout rate used by all dropout layers.
+    dropout = 0.1
+
+
+# Names of position encoding table which will be initialized externally.
+pos_enc_param_names = (
+    "src_pos_enc_table",
+    "trg_pos_enc_table", )
+
+# Names of all data layers in encoder listed in order.
+encoder_input_data_names = (
+    "src_word",
+    "src_pos",
+    "src_slf_attn_bias", )
+
+# Names of all data layers in decoder listed in order.
+decoder_input_data_names = (
+    "trg_word",
+    "trg_pos",
+    "trg_slf_attn_bias",
+    "trg_src_attn_bias",
+    "enc_output", )
+
+# Names of label related data layers listed in order.
+label_data_names = (
+    "lbl_word",
+    "lbl_weight", )
diff --git a/fluid/neural_machine_translation/transformer/infer.py b/fluid/neural_machine_translation/transformer/infer.py
new file mode 100644
index 0000000000..e4dee220ce
--- /dev/null
+++ b/fluid/neural_machine_translation/transformer/infer.py
@@ -0,0 +1,234 @@
+import numpy as np
+
+import paddle.v2 as paddle
+import paddle.fluid as fluid
+
+import model
+from model import wrap_encoder as encoder
+from model import wrap_decoder as decoder
+from config import InferTaskConfig, ModelHyperParams, \
+        encoder_input_data_names, decoder_input_data_names
+from train import pad_batch_data
+
+
+def translate_batch(exe, src_words, encoder, enc_in_names, enc_out_names,
+                    decoder, dec_in_names, dec_out_names, beam_size, max_length,
+                    n_best, batch_size, n_head, src_pad_idx, trg_pad_idx,
+                    bos_idx, eos_idx):
+    """
+    Run the encoder program once and run the decoder program multiple times to
+    implement beam search externally.
+    """
+    # Prepare data for encoder and run the encoder.
+    enc_in_data = pad_batch_data(
+        src_words,
+        src_pad_idx,
+        n_head,
+        is_target=False,
+        return_pos=True,
+        return_attn_bias=True,
+        return_max_len=True)
+    enc_output = exe.run(encoder,
+                         feed=dict(zip(enc_in_names, enc_in_data)),
+                         fetch_list=enc_out_names)[0]
+
+    # Beam Search.
+    # To store the beam info.
+    scores = np.zeros((batch_size, beam_size), dtype="float32")
+    prev_branchs = [[]] * batch_size
+    next_ids = [[]] * batch_size
+    # Use beam_map to map the instance idx in batch to beam idx, since the
+    # size of feeded batch is changing.
+    beam_map = range(batch_size)
+
+    def beam_backtrace(prev_branchs, next_ids, n_best=beam_size, add_bos=True):
+        """
+        Decode and select n_best sequences for one instance by backtrace.
+        """
+        seqs = []
+        for i in range(n_best):
+            k = i
+            seq = []
+            for j in range(len(prev_branchs) - 1, -1, -1):
+                seq.append(next_ids[j][k])
+                k = prev_branchs[j][k]
+            seq = seq[::-1]
+            seq = [bos_idx] + seq if add_bos else seq
+            seqs.append(seq)
+        return seqs
+
+    def init_dec_in_data(batch_size, beam_size, enc_in_data, enc_output):
+        """
+        Initialize the input data for decoder.
+        """
+        trg_words = np.array(
+            [[bos_idx]] * batch_size * beam_size, dtype="int64")
+        trg_pos = np.array([[1]] * batch_size * beam_size, dtype="int64")
+        src_max_length, src_slf_attn_bias, trg_max_len = enc_in_data[
+            -1], enc_in_data[-2], 1
+        # This is used to remove attention on subsequent words.
+        trg_slf_attn_bias = np.ones((batch_size * beam_size, trg_max_len,
+                                     trg_max_len))
+        trg_slf_attn_bias = np.triu(trg_slf_attn_bias, 1).reshape(
+            [-1, 1, trg_max_len, trg_max_len])
+        trg_slf_attn_bias = (np.tile(trg_slf_attn_bias, [1, n_head, 1, 1]) *
+                             [-1e9]).astype("float32")
+        # This is used to remove attention on the paddings of source sequences.
+        trg_src_attn_bias = np.tile(
+            src_slf_attn_bias[:, :, ::src_max_length, :],
+            [beam_size, 1, trg_max_len, 1])
+        enc_output = np.tile(enc_output, [beam_size, 1, 1])
+        return trg_words, trg_pos, trg_slf_attn_bias, trg_src_attn_bias, enc_output
+
+    def update_dec_in_data(dec_in_data, next_ids, active_beams):
+        """
+        Update the input data of decoder mainly by slicing from the previous
+        input data and dropping the finished instance beams.
+        """
+        trg_words, trg_pos, trg_slf_attn_bias, trg_src_attn_bias, enc_output = dec_in_data
+        trg_cur_len = len(next_ids[0]) + 1  # include the <bos>
+        trg_words = np.array(
+            [
+                beam_backtrace(
+                    prev_branchs[beam_idx], next_ids[beam_idx], add_bos=True)
+                for beam_idx in active_beams
+            ],
+            dtype="int64")
+        trg_words = trg_words.reshape([-1, 1])
+        trg_pos = np.array(
+            [range(1, trg_cur_len + 1)] * len(active_beams) * beam_size,
+            dtype="int64").reshape([-1, 1])
+        active_beams_indice = (
+            (np.array(active_beams) * beam_size)[:, np.newaxis] +
+            np.array(range(beam_size))[np.newaxis, :]).flatten()
+        # This is used to remove attention on subsequent words.
+        trg_slf_attn_bias = np.ones((len(active_beams) * beam_size, trg_cur_len,
+                                     trg_cur_len))
+        trg_slf_attn_bias = np.triu(trg_slf_attn_bias, 1).reshape(
+            [-1, 1, trg_cur_len, trg_cur_len])
+        trg_slf_attn_bias = (np.tile(trg_slf_attn_bias, [1, n_head, 1, 1]) *
+                             [-1e9]).astype("float32")
+        # This is used to remove attention on the paddings of source sequences.
+        trg_src_attn_bias = np.tile(trg_src_attn_bias[
+            active_beams_indice, :, ::trg_src_attn_bias.shape[2], :],
+                                    [1, 1, trg_cur_len, 1])
+        enc_output = enc_output[active_beams_indice, :, :]
+        return trg_words, trg_pos, trg_slf_attn_bias, trg_src_attn_bias, enc_output
+
+    dec_in_data = init_dec_in_data(batch_size, beam_size, enc_in_data,
+                                   enc_output)
+    for i in range(max_length):
+        predict_all = exe.run(decoder,
+                              feed=dict(zip(dec_in_names, dec_in_data)),
+                              fetch_list=dec_out_names)[0]
+        predict_all = np.log(
+            predict_all.reshape([len(beam_map) * beam_size, i + 1, -1])[:,
+                                                                        -1, :])
+        predict_all = (predict_all + scores[beam_map].reshape(
+            [len(beam_map) * beam_size, -1])).reshape(
+                [len(beam_map), beam_size, -1])
+        active_beams = []
+        for inst_idx, beam_idx in enumerate(beam_map):
+            predict = (predict_all[inst_idx, :, :]
+                       if i != 0 else predict_all[inst_idx, 0, :]).flatten()
+            top_k_indice = np.argpartition(predict, -beam_size)[-beam_size:]
+            top_scores_ids = top_k_indice[np.argsort(predict[top_k_indice])[::
+                                                                            -1]]
+            top_scores = predict[top_scores_ids]
+            scores[beam_idx] = top_scores
+            prev_branchs[beam_idx].append(top_scores_ids /
+                                          predict_all.shape[-1])
+            next_ids[beam_idx].append(top_scores_ids % predict_all.shape[-1])
+            if next_ids[beam_idx][-1][0] != eos_idx:
+                active_beams.append(beam_idx)
+        beam_map = active_beams
+        if len(beam_map) == 0:
+            break
+        dec_in_data = update_dec_in_data(dec_in_data, next_ids, active_beams)
+
+    # Decode beams and select n_best sequences for each instance by backtrace.
+    seqs = [beam_backtrace(prev_branchs[beam_idx], next_ids[beam_idx], n_best)]
+
+    return seqs, scores[:, :n_best].tolist()
+
+
+def main():
+    place = fluid.CUDAPlace(0) if InferTaskConfig.use_gpu else fluid.CPUPlace()
+    exe = fluid.Executor(place)
+    # The current program desc is coupled with batch_size and the only
+    # supported batch size is 1 currently.
+    encoder_program = fluid.Program()
+    model.batch_size = InferTaskConfig.batch_size
+    with fluid.program_guard(main_program=encoder_program):
+        enc_output = encoder(
+            ModelHyperParams.src_vocab_size + 1,
+            ModelHyperParams.max_length + 1, ModelHyperParams.n_layer,
+            ModelHyperParams.n_head, ModelHyperParams.d_key,
+            ModelHyperParams.d_value, ModelHyperParams.d_model,
+            ModelHyperParams.d_inner_hid, ModelHyperParams.dropout,
+            ModelHyperParams.src_pad_idx, ModelHyperParams.pos_pad_idx)
+
+    model.batch_size = InferTaskConfig.batch_size * InferTaskConfig.beam_size
+    decoder_program = fluid.Program()
+    with fluid.program_guard(main_program=decoder_program):
+        predict = decoder(
+            ModelHyperParams.trg_vocab_size + 1,
+            ModelHyperParams.max_length + 1, ModelHyperParams.n_layer,
+            ModelHyperParams.n_head, ModelHyperParams.d_key,
+            ModelHyperParams.d_value, ModelHyperParams.d_model,
+            ModelHyperParams.d_inner_hid, ModelHyperParams.dropout,
+            ModelHyperParams.trg_pad_idx, ModelHyperParams.pos_pad_idx)
+
+    # Load model parameters of encoder and decoder separately from the saved
+    # transformer model.
+    encoder_var_names = []
+    for op in encoder_program.block(0).ops:
+        encoder_var_names += op.input_arg_names
+    encoder_param_names = filter(
+        lambda var_name: isinstance(encoder_program.block(0).var(var_name),
+            fluid.framework.Parameter),
+        encoder_var_names)
+    encoder_params = map(encoder_program.block(0).var, encoder_param_names)
+    decoder_var_names = []
+    for op in decoder_program.block(0).ops:
+        decoder_var_names += op.input_arg_names
+    decoder_param_names = filter(
+        lambda var_name: isinstance(decoder_program.block(0).var(var_name),
+            fluid.framework.Parameter),
+        decoder_var_names)
+    decoder_params = map(decoder_program.block(0).var, decoder_param_names)
+    fluid.io.load_vars(exe, InferTaskConfig.model_path, vars=encoder_params)
+    fluid.io.load_vars(exe, InferTaskConfig.model_path, vars=decoder_params)
+
+    # This is used here to set dropout to the test mode.
+    encoder_program = fluid.io.get_inference_program(
+        target_vars=[enc_output], main_program=encoder_program)
+    decoder_program = fluid.io.get_inference_program(
+        target_vars=[predict], main_program=decoder_program)
+
+    test_data = paddle.batch(
+        paddle.dataset.wmt16.test(ModelHyperParams.src_vocab_size,
+                                  ModelHyperParams.trg_vocab_size),
+        batch_size=InferTaskConfig.batch_size)
+
+    trg_idx2word = paddle.dataset.wmt16.get_dict(
+        "de", dict_size=ModelHyperParams.trg_vocab_size, reverse=True)
+
+    for batch_id, data in enumerate(test_data()):
+        batch_seqs, batch_scores = translate_batch(
+            exe, [item[0] for item in data], encoder_program,
+            encoder_input_data_names, [enc_output.name], decoder_program,
+            decoder_input_data_names, [predict.name], InferTaskConfig.beam_size,
+            InferTaskConfig.max_length, InferTaskConfig.n_best,
+            len(data), ModelHyperParams.n_head, ModelHyperParams.src_pad_idx,
+            ModelHyperParams.trg_pad_idx, ModelHyperParams.bos_idx,
+            ModelHyperParams.eos_idx)
+        for i in range(len(batch_seqs)):
+            seqs = batch_seqs[i]
+            scores = batch_scores[i]
+            for seq in seqs:
+                print(" ".join([trg_idx2word[idx] for idx in seq]))
+
+
+if __name__ == "__main__":
+    main()
diff --git a/fluid/neural_machine_translation/transformer/model.py b/fluid/neural_machine_translation/transformer/model.py
new file mode 100644
index 0000000000..ba5ba44707
--- /dev/null
+++ b/fluid/neural_machine_translation/transformer/model.py
@@ -0,0 +1,595 @@
+from functools import partial
+import numpy as np
+
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+
+from config import TrainTaskConfig, pos_enc_param_names, \
+    encoder_input_data_names, decoder_input_data_names, label_data_names
+
+# FIXME(guosheng): Remove out the batch_size from the model.
+batch_size = TrainTaskConfig.batch_size
+
+
+def position_encoding_init(n_position, d_pos_vec):
+    """
+    Generate the initial values for the sinusoid position encoding table.
+    """
+    position_enc = np.array([[
+        pos / np.power(10000, 2 * (j // 2) / d_pos_vec)
+        for j in range(d_pos_vec)
+    ] if pos != 0 else np.zeros(d_pos_vec) for pos in range(n_position)])
+    position_enc[1:, 0::2] = np.sin(position_enc[1:, 0::2])  # dim 2i
+    position_enc[1:, 1::2] = np.cos(position_enc[1:, 1::2])  # dim 2i+1
+    return position_enc.astype("float32")
+
+
+def multi_head_attention(queries,
+                         keys,
+                         values,
+                         attn_bias,
+                         d_key,
+                         d_value,
+                         d_model,
+                         n_head=1,
+                         dropout_rate=0.):
+    """
+    Multi-Head Attention. Note that attn_bias is added to the logit before
+    computing softmax activiation to mask certain selected positions so that
+    they will not considered in attention weights.
+    """
+    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
+        raise ValueError(
+            "Inputs: quries, keys and values should all be 3-D tensors.")
+
+    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
+        """
+        Add linear projection to queries, keys, and values.
+        """
+        q = layers.fc(input=queries,
+                      size=d_key * n_head,
+                      param_attr=fluid.initializer.Xavier(
+                          uniform=False,
+                          fan_in=d_model * d_key,
+                          fan_out=n_head * d_key),
+                      bias_attr=False,
+                      num_flatten_dims=2)
+        k = layers.fc(input=keys,
+                      size=d_key * n_head,
+                      param_attr=fluid.initializer.Xavier(
+                          uniform=False,
+                          fan_in=d_model * d_key,
+                          fan_out=n_head * d_key),
+                      bias_attr=False,
+                      num_flatten_dims=2)
+        v = layers.fc(input=values,
+                      size=d_value * n_head,
+                      param_attr=fluid.initializer.Xavier(
+                          uniform=False,
+                          fan_in=d_model * d_value,
+                          fan_out=n_head * d_value),
+                      bias_attr=False,
+                      num_flatten_dims=2)
+        return q, k, v
+
+    def __split_heads(x, n_head):
+        """
+        Reshape the last dimension of inpunt tensor x so that it becomes two
+        dimensions and then transpose. Specifically, input a tensor with shape
+        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
+        with shape [bs, n_head, max_sequence_length, hidden_dim].
+        """
+        if n_head == 1:
+            return x
+
+        hidden_size = x.shape[-1]
+        # FIXME(guosheng): Decouple the program desc with batch_size.
+        reshaped = layers.reshape(
+            x=x, shape=[batch_size, -1, n_head, hidden_size // n_head])
+
+        # permuate the dimensions into:
+        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
+        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
+
+    def __combine_heads(x):
+        """
+        Transpose and then reshape the last two dimensions of inpunt tensor x
+        so that it becomes one dimension, which is reverse to __split_heads.
+        """
+        if len(x.shape) == 3: return x
+        if len(x.shape) != 4:
+            raise ValueError("Input(x) should be a 4-D Tensor.")
+
+        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
+        # FIXME(guosheng): Decouple the program desc with batch_size.
+        return layers.reshape(
+            x=trans_x,
+            shape=map(int,
+                      [batch_size, -1, trans_x.shape[2] * trans_x.shape[3]]))
+
+    def scaled_dot_product_attention(q, k, v, attn_bias, d_model, dropout_rate):
+        """
+        Scaled Dot-Product Attention
+        """
+
+        # FIXME(guosheng): Optimize the shape in reshape_op or softmax_op.
+
+        # The current implementation of softmax_op only supports 2D tensor,
+        # consequently it cannot be directly used here.
+        # If to use the reshape_op, Besides, the shape of product inferred in
+        # compile-time is not the actual shape in run-time. It cann't be used
+        # to set the attribute of reshape_op.
+        # So, here define the softmax for temporary solution.
+
+        def __softmax(x, eps=1e-9):
+            exp_out = layers.exp(x=x)
+            sum_out = layers.reduce_sum(exp_out, dim=-1, keep_dim=False)
+            return layers.elementwise_div(x=exp_out, y=sum_out, axis=0)
+
+        scaled_q = layers.scale(x=q, scale=d_model**-0.5)
+        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
+        weights = __softmax(
+            layers.elementwise_add(
+                x=product, y=attn_bias) if attn_bias else product)
+        if dropout_rate:
+            weights = layers.dropout(
+                weights, dropout_prob=dropout_rate, is_test=False)
+        out = layers.matmul(weights, v)
+        return out
+
+    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
+
+    q = __split_heads(q, n_head)
+    k = __split_heads(k, n_head)
+    v = __split_heads(v, n_head)
+
+    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_model,
+                                                  dropout_rate)
+
+    out = __combine_heads(ctx_multiheads)
+
+    # Project back to the model size.
+    proj_out = layers.fc(input=out,
+                         size=d_model,
+                         param_attr=fluid.initializer.Xavier(uniform=False),
+                         bias_attr=False,
+                         num_flatten_dims=2)
+    return proj_out
+
+
+def positionwise_feed_forward(x, d_inner_hid, d_hid):
+    """
+    Position-wise Feed-Forward Networks.
+    This module consists of two linear transformations with a ReLU activation
+    in between, which is applied to each position separately and identically.
+    """
+    hidden = layers.fc(input=x,
+                       size=d_inner_hid,
+                       num_flatten_dims=2,
+                       param_attr=fluid.initializer.Uniform(
+                           low=-(d_hid**-0.5), high=(d_hid**-0.5)),
+                       act="relu")
+    out = layers.fc(input=hidden,
+                    size=d_hid,
+                    num_flatten_dims=2,
+                    param_attr=fluid.initializer.Uniform(
+                        low=-(d_inner_hid**-0.5), high=(d_inner_hid**-0.5)))
+    return out
+
+
+def pre_post_process_layer(prev_out, out, process_cmd, dropout=0.):
+    """
+    Add residual connection, layer normalization and droput to the out tensor
+    optionally according to the value of process_cmd.
+
+    This will be used before or after multi-head attention and position-wise
+    feed-forward networks.
+    """
+    for cmd in process_cmd:
+        if cmd == "a":  # add residual connection
+            out = out + prev_out if prev_out else out
+        elif cmd == "n":  # add layer normalization
+            out = layers.layer_norm(
+                out,
+                begin_norm_axis=len(out.shape) - 1,
+                param_attr=fluid.initializer.Constant(1.),
+                bias_attr=fluid.initializer.Constant(0.))
+        elif cmd == "d":  # add dropout
+            if dropout:
+                out = layers.dropout(out, dropout_prob=dropout, is_test=False)
+    return out
+
+
+pre_process_layer = partial(pre_post_process_layer, None)
+post_process_layer = pre_post_process_layer
+
+
+def prepare_encoder(src_word,
+                    src_pos,
+                    src_vocab_size,
+                    src_emb_dim,
+                    src_pad_idx,
+                    src_max_len,
+                    dropout=0.,
+                    pos_pad_idx=0,
+                    pos_enc_param_name=None):
+    """Add word embeddings and position encodings.
+    The output tensor has a shape of:
+    [batch_size, max_src_length_in_batch, d_model].
+
+    This module is used at the bottom of the encoder stacks.
+    """
+    src_word_emb = layers.embedding(
+        src_word,
+        size=[src_vocab_size, src_emb_dim],
+        padding_idx=src_pad_idx,
+        param_attr=fluid.initializer.Normal(0., 1.))
+    src_pos_enc = layers.embedding(
+        src_pos,
+        size=[src_max_len, src_emb_dim],
+        padding_idx=pos_pad_idx,
+        param_attr=fluid.ParamAttr(
+            name=pos_enc_param_name, trainable=False))
+    enc_input = src_word_emb + src_pos_enc
+
+    # FIXME(guosheng): Decouple the program desc with batch_size.
+    enc_input = layers.reshape(x=enc_input, shape=[batch_size, -1, src_emb_dim])
+    return layers.dropout(
+        enc_input, dropout_prob=dropout,
+        is_test=False) if dropout else enc_input
+
+
+prepare_encoder = partial(
+    prepare_encoder, pos_enc_param_name=pos_enc_param_names[0])
+prepare_decoder = partial(
+    prepare_encoder, pos_enc_param_name=pos_enc_param_names[1])
+
+
+def encoder_layer(enc_input,
+                  attn_bias,
+                  n_head,
+                  d_key,
+                  d_value,
+                  d_model,
+                  d_inner_hid,
+                  dropout_rate=0.):
+    """The encoder layers that can be stacked to form a deep encoder.
+
+    This module consits of a multi-head (self) attention followed by
+    position-wise feed-forward networks and both the two components companied
+    with the post_process_layer to add residual connection, layer normalization
+    and droput.
+    """
+    attn_output = multi_head_attention(enc_input, enc_input, enc_input,
+                                       attn_bias, d_key, d_value, d_model,
+                                       n_head, dropout_rate)
+    attn_output = post_process_layer(enc_input, attn_output, "dan",
+                                     dropout_rate)
+    ffd_output = positionwise_feed_forward(attn_output, d_inner_hid, d_model)
+    return post_process_layer(attn_output, ffd_output, "dan", dropout_rate)
+
+
+def encoder(enc_input,
+            attn_bias,
+            n_layer,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            dropout_rate=0.):
+    """
+    The encoder is composed of a stack of identical layers returned by calling
+    encoder_layer.
+    """
+    for i in range(n_layer):
+        enc_output = encoder_layer(
+            enc_input,
+            attn_bias,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            dropout_rate, )
+        enc_input = enc_output
+    return enc_output
+
+
+def decoder_layer(dec_input,
+                  enc_output,
+                  slf_attn_bias,
+                  dec_enc_attn_bias,
+                  n_head,
+                  d_key,
+                  d_value,
+                  d_model,
+                  d_inner_hid,
+                  dropout_rate=0.):
+    """ The layer to be stacked in decoder part.
+
+    The structure of this module is similar to that in the encoder part except
+    a multi-head attention is added to implement encoder-decoder attention.
+    """
+    slf_attn_output = multi_head_attention(
+        dec_input,
+        dec_input,
+        dec_input,
+        slf_attn_bias,
+        d_key,
+        d_value,
+        d_model,
+        n_head,
+        dropout_rate, )
+    slf_attn_output = post_process_layer(
+        dec_input,
+        slf_attn_output,
+        "dan",  # residual connection + dropout + layer normalization
+        dropout_rate, )
+    enc_attn_output = multi_head_attention(
+        slf_attn_output,
+        enc_output,
+        enc_output,
+        dec_enc_attn_bias,
+        d_key,
+        d_value,
+        d_model,
+        n_head,
+        dropout_rate, )
+    enc_attn_output = post_process_layer(
+        slf_attn_output,
+        enc_attn_output,
+        "dan",  # residual connection + dropout + layer normalization
+        dropout_rate, )
+    ffd_output = positionwise_feed_forward(
+        enc_attn_output,
+        d_inner_hid,
+        d_model, )
+    dec_output = post_process_layer(
+        enc_attn_output,
+        ffd_output,
+        "dan",  # residual connection + dropout + layer normalization
+        dropout_rate, )
+    return dec_output
+
+
+def decoder(dec_input,
+            enc_output,
+            dec_slf_attn_bias,
+            dec_enc_attn_bias,
+            n_layer,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            dropout_rate=0.):
+    """
+    The decoder is composed of a stack of identical decoder_layer layers.
+    """
+    for i in range(n_layer):
+        dec_output = decoder_layer(
+            dec_input,
+            enc_output,
+            dec_slf_attn_bias,
+            dec_enc_attn_bias,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            dropout_rate, )
+        dec_input = dec_output
+    return dec_output
+
+
+def make_inputs(input_data_names,
+                n_head,
+                d_model,
+                batch_size,
+                max_length,
+                is_pos,
+                slf_attn_bias_flag,
+                src_attn_bias_flag,
+                enc_output_flag=False):
+    """
+    Define the input data layers for the transformer model.
+    """
+    input_layers = []
+    # The shapes here act as placeholder.
+    # The shapes set here is to pass the infer-shape in compile time.
+    word = layers.data(
+        name=input_data_names[len(input_layers)],
+        shape=[batch_size * max_length, 1],
+        dtype="int64",
+        append_batch_size=False)
+    input_layers += [word]
+    # This is used for position data or label weight.
+    pos = layers.data(
+        name=input_data_names[len(input_layers)],
+        shape=[batch_size * max_length, 1],
+        dtype="int64" if is_pos else "float32",
+        append_batch_size=False)
+    input_layers += [pos]
+    if slf_attn_bias_flag:
+        # This input is used to remove attention weights on paddings for the
+        # encoder and to remove attention weights on subsequent words for the
+        # decoder.
+        slf_attn_bias = layers.data(
+            name=input_data_names[len(input_layers)],
+            shape=[batch_size, n_head, max_length, max_length],
+            dtype="float32",
+            append_batch_size=False)
+        input_layers += [slf_attn_bias]
+    if src_attn_bias_flag:
+        # This input is used to remove attention weights on paddings.
+        src_attn_bias = layers.data(
+            name=input_data_names[len(input_layers)],
+            shape=[batch_size, n_head, max_length, max_length],
+            dtype="float32",
+            append_batch_size=False)
+        input_layers += [src_attn_bias]
+    if enc_output_flag:
+        enc_output = layers.data(
+            name=input_data_names[len(input_layers)],
+            shape=[batch_size, max_length, d_model],
+            dtype="float32",
+            append_batch_size=False)
+        input_layers += [enc_output]
+    return input_layers
+
+
+def transformer(
+        src_vocab_size,
+        trg_vocab_size,
+        max_length,
+        n_layer,
+        n_head,
+        d_key,
+        d_value,
+        d_model,
+        d_inner_hid,
+        dropout_rate,
+        src_pad_idx,
+        trg_pad_idx,
+        pos_pad_idx, ):
+    enc_input_layers = make_inputs(encoder_input_data_names, n_head, d_model,
+                                   batch_size, max_length, True, True, False)
+
+    enc_output = wrap_encoder(
+        src_vocab_size,
+        max_length,
+        n_layer,
+        n_head,
+        d_key,
+        d_value,
+        d_model,
+        d_inner_hid,
+        dropout_rate,
+        src_pad_idx,
+        pos_pad_idx,
+        enc_input_layers, )
+
+    dec_input_layers = make_inputs(decoder_input_data_names, n_head, d_model,
+                                   batch_size, max_length, True, True, True)
+
+    predict = wrap_decoder(
+        trg_vocab_size,
+        max_length,
+        n_layer,
+        n_head,
+        d_key,
+        d_value,
+        d_model,
+        d_inner_hid,
+        dropout_rate,
+        trg_pad_idx,
+        pos_pad_idx,
+        dec_input_layers,
+        enc_output, )
+
+    # Padding index do not contribute to the total loss. The weights is used to
+    # cancel padding index in calculating the loss.
+    gold, weights = make_inputs(label_data_names, n_head, d_model, batch_size,
+                                max_length, False, False, False)
+    cost = layers.cross_entropy(input=predict, label=gold)
+    weighted_cost = cost * weights
+    return layers.reduce_sum(weighted_cost), predict
+
+
+def wrap_encoder(src_vocab_size,
+                 max_length,
+                 n_layer,
+                 n_head,
+                 d_key,
+                 d_value,
+                 d_model,
+                 d_inner_hid,
+                 dropout_rate,
+                 src_pad_idx,
+                 pos_pad_idx,
+                 enc_input_layers=None):
+    """
+    The wrapper assembles together all needed layers for the encoder.
+    """
+    if enc_input_layers is None:
+        # This is used to implement independent encoder program in inference.
+        src_word, src_pos, src_slf_attn_bias = make_inputs(
+            encoder_input_data_names, n_head, d_model, batch_size, max_length,
+            True, True, False)
+    else:
+        src_word, src_pos, src_slf_attn_bias = enc_input_layers
+    enc_input = prepare_encoder(
+        src_word,
+        src_pos,
+        src_vocab_size,
+        d_model,
+        src_pad_idx,
+        max_length,
+        dropout_rate, )
+    enc_output = encoder(
+        enc_input,
+        src_slf_attn_bias,
+        n_layer,
+        n_head,
+        d_key,
+        d_value,
+        d_model,
+        d_inner_hid,
+        dropout_rate, )
+    return enc_output
+
+
+def wrap_decoder(trg_vocab_size,
+                 max_length,
+                 n_layer,
+                 n_head,
+                 d_key,
+                 d_value,
+                 d_model,
+                 d_inner_hid,
+                 dropout_rate,
+                 trg_pad_idx,
+                 pos_pad_idx,
+                 dec_input_layers=None,
+                 enc_output=None):
+    """
+    The wrapper assembles together all needed layers for the decoder.
+    """
+    if dec_input_layers is None:
+        # This is used to implement independent decoder program in inference.
+        trg_word, trg_pos, trg_slf_attn_bias, trg_src_attn_bias, enc_output = make_inputs(
+            decoder_input_data_names, n_head, d_model, batch_size, max_length,
+            True, True, True, True)
+    else:
+        trg_word, trg_pos, trg_slf_attn_bias, trg_src_attn_bias = dec_input_layers
+
+    dec_input = prepare_decoder(
+        trg_word,
+        trg_pos,
+        trg_vocab_size,
+        d_model,
+        trg_pad_idx,
+        max_length,
+        dropout_rate, )
+    dec_output = decoder(
+        dec_input,
+        enc_output,
+        trg_slf_attn_bias,
+        trg_src_attn_bias,
+        n_layer,
+        n_head,
+        d_key,
+        d_value,
+        d_model,
+        d_inner_hid,
+        dropout_rate, )
+
+    predict = layers.reshape(
+        x=layers.fc(input=dec_output,
+                    size=trg_vocab_size,
+                    bias_attr=False,
+                    num_flatten_dims=2),
+        shape=[-1, trg_vocab_size],
+        act="softmax")
+    return predict
diff --git a/fluid/neural_machine_translation/transformer/optim.py b/fluid/neural_machine_translation/transformer/optim.py
new file mode 100644
index 0000000000..9905e6594a
--- /dev/null
+++ b/fluid/neural_machine_translation/transformer/optim.py
@@ -0,0 +1,40 @@
+import numpy as np
+
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+
+
+class LearningRateScheduler(object):
+    """
+    Wrapper for learning rate scheduling as described in the Transformer paper.
+    LearningRateScheduler adapts the learning rate externally and the adapted
+    learning rate will be feeded into the main_program as input data.
+    """
+
+    def __init__(self,
+                 d_model,
+                 warmup_steps,
+                 place,
+                 learning_rate=0.001,
+                 current_steps=0,
+                 name="learning_rate"):
+        self.current_steps = current_steps
+        self.warmup_steps = warmup_steps
+        self.d_model = d_model
+        self.learning_rate = layers.create_global_var(
+            name=name,
+            shape=[1],
+            value=float(learning_rate),
+            dtype="float32",
+            persistable=True)
+        self.place = place
+
+    def update_learning_rate(self, data_input):
+        self.current_steps += 1
+        lr_value = np.power(self.d_model, -0.5) * np.min([
+            np.power(self.current_steps, -0.5),
+            np.power(self.warmup_steps, -1.5) * self.current_steps
+        ])
+        lr_tensor = fluid.LoDTensor()
+        lr_tensor.set(np.array([lr_value], dtype="float32"), self.place)
+        data_input[self.learning_rate.name] = lr_tensor
diff --git a/fluid/neural_machine_translation/transformer/train.py b/fluid/neural_machine_translation/transformer/train.py
new file mode 100644
index 0000000000..65de8ef7fa
--- /dev/null
+++ b/fluid/neural_machine_translation/transformer/train.py
@@ -0,0 +1,174 @@
+import os
+import numpy as np
+
+import paddle.v2 as paddle
+import paddle.fluid as fluid
+
+from model import transformer, position_encoding_init
+from optim import LearningRateScheduler
+from config import TrainTaskConfig, ModelHyperParams, pos_enc_param_names, \
+        encoder_input_data_names, decoder_input_data_names, label_data_names
+
+
+def pad_batch_data(insts,
+                   pad_idx,
+                   n_head,
+                   is_target=False,
+                   return_pos=True,
+                   return_attn_bias=True,
+                   return_max_len=True):
+    """
+    Pad the instances to the max sequence length in batch, and generate the
+    corresponding position data and attention bias.
+    """
+    return_list = []
+    max_len = max(len(inst) for inst in insts)
+    inst_data = np.array(
+        [inst + [pad_idx] * (max_len - len(inst)) for inst in insts])
+    return_list += [inst_data.astype("int64").reshape([-1, 1])]
+    if return_pos:
+        inst_pos = np.array([[
+            pos_i + 1 if w_i != pad_idx else 0 for pos_i, w_i in enumerate(inst)
+        ] for inst in inst_data])
+
+        return_list += [inst_pos.astype("int64").reshape([-1, 1])]
+    if return_attn_bias:
+        if is_target:
+            # This is used to avoid attention on paddings and subsequent
+            # words.
+            slf_attn_bias_data = np.ones((inst_data.shape[0], max_len, max_len))
+            slf_attn_bias_data = np.triu(slf_attn_bias_data, 1).reshape(
+                [-1, 1, max_len, max_len])
+            slf_attn_bias_data = np.tile(slf_attn_bias_data,
+                                         [1, n_head, 1, 1]) * [-1e9]
+        else:
+            # This is used to avoid attention on paddings.
+            slf_attn_bias_data = np.array([[0] * len(inst) + [-1e9] *
+                                           (max_len - len(inst))
+                                           for inst in insts])
+            slf_attn_bias_data = np.tile(
+                slf_attn_bias_data.reshape([-1, 1, 1, max_len]),
+                [1, n_head, max_len, 1])
+        return_list += [slf_attn_bias_data.astype("float32")]
+    if return_max_len:
+        return_list += [max_len]
+    return return_list if len(return_list) > 1 else return_list[0]
+
+
+def prepare_batch_input(insts, input_data_names, src_pad_idx, trg_pad_idx,
+                        max_length, n_head):
+    """
+    Put all padded data needed by training into a dict.
+    """
+    src_word, src_pos, src_slf_attn_bias, src_max_len = pad_batch_data(
+        [inst[0] for inst in insts], src_pad_idx, n_head, is_target=False)
+    trg_word, trg_pos, trg_slf_attn_bias, trg_max_len = pad_batch_data(
+        [inst[1] for inst in insts], trg_pad_idx, n_head, is_target=True)
+    trg_src_attn_bias = np.tile(src_slf_attn_bias[:, :, ::src_max_len, :],
+                                [1, 1, trg_max_len, 1]).astype("float32")
+    lbl_word = pad_batch_data([inst[2] for inst in insts], trg_pad_idx, n_head,
+                              False, False, False, False)
+    lbl_weight = (lbl_word != trg_pad_idx).astype("float32").reshape([-1, 1])
+    input_dict = dict(
+        zip(input_data_names, [
+            src_word, src_pos, src_slf_attn_bias, trg_word, trg_pos,
+            trg_slf_attn_bias, trg_src_attn_bias, lbl_word, lbl_weight
+        ]))
+    return input_dict
+
+
+def main():
+    place = fluid.CUDAPlace(0) if TrainTaskConfig.use_gpu else fluid.CPUPlace()
+    exe = fluid.Executor(place)
+
+    cost, predict = transformer(
+        ModelHyperParams.src_vocab_size + 1,
+        ModelHyperParams.trg_vocab_size + 1, ModelHyperParams.max_length + 1,
+        ModelHyperParams.n_layer, ModelHyperParams.n_head,
+        ModelHyperParams.d_key, ModelHyperParams.d_value,
+        ModelHyperParams.d_model, ModelHyperParams.d_inner_hid,
+        ModelHyperParams.dropout, ModelHyperParams.src_pad_idx,
+        ModelHyperParams.trg_pad_idx, ModelHyperParams.pos_pad_idx)
+
+    lr_scheduler = LearningRateScheduler(ModelHyperParams.d_model,
+                                         TrainTaskConfig.warmup_steps, place,
+                                         TrainTaskConfig.learning_rate)
+    optimizer = fluid.optimizer.Adam(
+        learning_rate=lr_scheduler.learning_rate,
+        beta1=TrainTaskConfig.beta1,
+        beta2=TrainTaskConfig.beta2,
+        epsilon=TrainTaskConfig.eps)
+    optimizer.minimize(cost)
+
+    train_data = paddle.batch(
+        paddle.reader.shuffle(
+            paddle.dataset.wmt16.train(ModelHyperParams.src_vocab_size,
+                                       ModelHyperParams.trg_vocab_size),
+            buf_size=100000),
+        batch_size=TrainTaskConfig.batch_size)
+
+    # Program to do validation.
+    test_program = fluid.default_main_program().clone()
+    with fluid.program_guard(test_program):
+        test_program = fluid.io.get_inference_program([cost])
+    val_data = paddle.batch(
+        paddle.dataset.wmt16.validation(ModelHyperParams.src_vocab_size,
+                                        ModelHyperParams.trg_vocab_size),
+        batch_size=TrainTaskConfig.batch_size)
+
+    def test(exe):
+        test_costs = []
+        for batch_id, data in enumerate(val_data()):
+            if len(data) != TrainTaskConfig.batch_size:
+                continue
+            data_input = prepare_batch_input(
+                data, encoder_input_data_names + decoder_input_data_names[:-1] +
+                label_data_names, ModelHyperParams.src_pad_idx,
+                ModelHyperParams.trg_pad_idx, ModelHyperParams.max_length,
+                ModelHyperParams.n_head)
+            test_cost = exe.run(test_program,
+                                feed=data_input,
+                                fetch_list=[cost])[0]
+            test_costs.append(test_cost)
+        return np.mean(test_costs)
+
+    # Initialize the parameters.
+    exe.run(fluid.framework.default_startup_program())
+    for pos_enc_param_name in pos_enc_param_names:
+        pos_enc_param = fluid.global_scope().find_var(
+            pos_enc_param_name).get_tensor()
+        pos_enc_param.set(
+            position_encoding_init(ModelHyperParams.max_length + 1,
+                                   ModelHyperParams.d_model), place)
+
+    for pass_id in xrange(TrainTaskConfig.pass_num):
+        for batch_id, data in enumerate(train_data()):
+            # The current program desc is coupled with batch_size, thus all
+            # mini-batches must have the same number of instances currently.
+            if len(data) != TrainTaskConfig.batch_size:
+                continue
+            data_input = prepare_batch_input(
+                data, encoder_input_data_names + decoder_input_data_names[:-1] +
+                label_data_names, ModelHyperParams.src_pad_idx,
+                ModelHyperParams.trg_pad_idx, ModelHyperParams.max_length,
+                ModelHyperParams.n_head)
+            lr_scheduler.update_learning_rate(data_input)
+            outs = exe.run(fluid.framework.default_main_program(),
+                           feed=data_input,
+                           fetch_list=[cost],
+                           use_program_cache=True)
+            cost_val = np.array(outs[0])
+            print("pass_id = " + str(pass_id) + " batch = " + str(batch_id) +
+                  " cost = " + str(cost_val))
+        # Validate and save the model for inference.
+        val_cost = test(exe)
+        print("pass_id = " + str(pass_id) + " val_cost = " + str(val_cost))
+        fluid.io.save_inference_model(
+            os.path.join(TrainTaskConfig.model_dir,
+                         "pass_" + str(pass_id) + ".infer.model"),
+            encoder_input_data_names + decoder_input_data_names[:-1],
+            [predict], exe)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/fluid/object_detection/README.md b/fluid/object_detection/README.md
new file mode 100644
index 0000000000..4aa2c32865
--- /dev/null
+++ b/fluid/object_detection/README.md
@@ -0,0 +1,8 @@
+The minimum PaddlePaddle version needed for the code sample in this directory is the lastest develop branch. If you are on a version of PaddlePaddle earlier than this, [please update your installation](http://www.paddlepaddle.org/docs/develop/documentation/en/build_and_install/pip_install_en.html).
+
+---
+
+# MobileNet-SSD
+
+This model built with paddle fluid is still under active development and is not
+the final version. We welcome feedbacks.
diff --git a/fluid/object_detection/data/label_list b/fluid/object_detection/data/label_list
new file mode 100644
index 0000000000..87df23ce0a
--- /dev/null
+++ b/fluid/object_detection/data/label_list
@@ -0,0 +1,21 @@
+background
+aeroplane
+bicycle
+bird
+boat
+bottle
+bus
+car
+cat
+chair
+cow
+diningtable
+dog
+horse
+motorbike
+person
+pottedplant
+sheep
+sofa
+train
+tvmonitor
diff --git a/fluid/object_detection/data/prepare_voc_data.py b/fluid/object_detection/data/prepare_voc_data.py
new file mode 100644
index 0000000000..a652956e91
--- /dev/null
+++ b/fluid/object_detection/data/prepare_voc_data.py
@@ -0,0 +1,63 @@
+import os
+import os.path as osp
+import re
+import random
+
+devkit_dir = './VOCdevkit'
+years = ['2007', '2012']
+
+
+def get_dir(devkit_dir, year, type):
+    return osp.join(devkit_dir, 'VOC' + year, type)
+
+
+def walk_dir(devkit_dir, year):
+    filelist_dir = get_dir(devkit_dir, year, 'ImageSets/Main')
+    annotation_dir = get_dir(devkit_dir, year, 'Annotations')
+    img_dir = get_dir(devkit_dir, year, 'JPEGImages')
+    trainval_list = []
+    test_list = []
+    added = set()
+
+    for _, _, files in os.walk(filelist_dir):
+        for fname in files:
+            img_ann_list = []
+            if re.match('[a-z]+_trainval\.txt', fname):
+                img_ann_list = trainval_list
+            elif re.match('[a-z]+_test\.txt', fname):
+                img_ann_list = test_list
+            else:
+                continue
+            fpath = osp.join(filelist_dir, fname)
+            for line in open(fpath):
+                name_prefix = line.strip().split()[0]
+                if name_prefix in added:
+                    continue
+                added.add(name_prefix)
+                ann_path = osp.join(annotation_dir, name_prefix + '.xml')
+                img_path = osp.join(img_dir, name_prefix + '.jpg')
+                assert os.path.isfile(ann_path), 'file %s not found.' % ann_path
+                assert os.path.isfile(img_path), 'file %s not found.' % img_path
+                img_ann_list.append((img_path, ann_path))
+
+    return trainval_list, test_list
+
+
+def prepare_filelist(devkit_dir, years, output_dir):
+    trainval_list = []
+    test_list = []
+    for year in years:
+        trainval, test = walk_dir(devkit_dir, year)
+        trainval_list.extend(trainval)
+        test_list.extend(test)
+    random.shuffle(trainval_list)
+    with open(osp.join(output_dir, 'trainval.txt'), 'w') as ftrainval:
+        for item in trainval_list:
+            ftrainval.write(item[0] + ' ' + item[1] + '\n')
+
+    with open(osp.join(output_dir, 'test.txt'), 'w') as ftest:
+        for item in test_list:
+            ftest.write(item[0] + ' ' + item[1] + '\n')
+
+
+prepare_filelist(devkit_dir, years, '.')
diff --git a/fluid/object_detection/image_util.py b/fluid/object_detection/image_util.py
new file mode 100644
index 0000000000..781932293e
--- /dev/null
+++ b/fluid/object_detection/image_util.py
@@ -0,0 +1,235 @@
+from PIL import Image, ImageEnhance
+import numpy as np
+import random
+import math
+
+
+class sampler():
+    def __init__(self, max_sample, max_trial, min_scale, max_scale,
+                 min_aspect_ratio, max_aspect_ratio, min_jaccard_overlap,
+                 max_jaccard_overlap):
+        self.max_sample = max_sample
+        self.max_trial = max_trial
+        self.min_scale = min_scale
+        self.max_scale = max_scale
+        self.min_aspect_ratio = min_aspect_ratio
+        self.max_aspect_ratio = max_aspect_ratio
+        self.min_jaccard_overlap = min_jaccard_overlap
+        self.max_jaccard_overlap = max_jaccard_overlap
+
+
+class bbox():
+    def __init__(self, xmin, ymin, xmax, ymax):
+        self.xmin = xmin
+        self.ymin = ymin
+        self.xmax = xmax
+        self.ymax = ymax
+
+
+def bbox_area(src_bbox):
+    width = src_bbox.xmax - src_bbox.xmin
+    height = src_bbox.ymax - src_bbox.ymin
+    return width * height
+
+
+def generate_sample(sampler):
+    scale = random.uniform(sampler.min_scale, sampler.max_scale)
+    min_aspect_ratio = max(sampler.min_aspect_ratio, (scale**2.0))
+    max_aspect_ratio = min(sampler.max_aspect_ratio, 1 / (scale**2.0))
+    aspect_ratio = random.uniform(min_aspect_ratio, max_aspect_ratio)
+    bbox_width = scale * (aspect_ratio**0.5)
+    bbox_height = scale / (aspect_ratio**0.5)
+    xmin_bound = 1 - bbox_width
+    ymin_bound = 1 - bbox_height
+    xmin = random.uniform(0, xmin_bound)
+    ymin = random.uniform(0, ymin_bound)
+    xmax = xmin + bbox_width
+    ymax = ymin + bbox_height
+    sampled_bbox = bbox(xmin, ymin, xmax, ymax)
+    return sampled_bbox
+
+
+def jaccard_overlap(sample_bbox, object_bbox):
+    if sample_bbox.xmin >= object_bbox.xmax or \
+            sample_bbox.xmax <= object_bbox.xmin or \
+            sample_bbox.ymin >= object_bbox.ymax or \
+            sample_bbox.ymax <= object_bbox.ymin:
+        return 0
+    intersect_xmin = max(sample_bbox.xmin, object_bbox.xmin)
+    intersect_ymin = max(sample_bbox.ymin, object_bbox.ymin)
+    intersect_xmax = min(sample_bbox.xmax, object_bbox.xmax)
+    intersect_ymax = min(sample_bbox.ymax, object_bbox.ymax)
+    intersect_size = (intersect_xmax - intersect_xmin) * (
+        intersect_ymax - intersect_ymin)
+    sample_bbox_size = bbox_area(sample_bbox)
+    object_bbox_size = bbox_area(object_bbox)
+    overlap = intersect_size / (
+        sample_bbox_size + object_bbox_size - intersect_size)
+    return overlap
+
+
+def satisfy_sample_constraint(sampler, sample_bbox, bbox_labels):
+    if sampler.min_jaccard_overlap == 0 and sampler.max_jaccard_overlap == 0:
+        return True
+    for i in range(len(bbox_labels)):
+        object_bbox = bbox(bbox_labels[i][1], bbox_labels[i][2],
+                           bbox_labels[i][3], bbox_labels[i][4])
+        overlap = jaccard_overlap(sample_bbox, object_bbox)
+        if sampler.min_jaccard_overlap != 0 and \
+                overlap < sampler.min_jaccard_overlap:
+            continue
+        if sampler.max_jaccard_overlap != 0 and \
+                overlap > sampler.max_jaccard_overlap:
+            continue
+        return True
+    return False
+
+
+def generate_batch_samples(batch_sampler, bbox_labels, image_width,
+                           image_height):
+    sampled_bbox = []
+    index = []
+    c = 0
+    for sampler in batch_sampler:
+        found = 0
+        for i in range(sampler.max_trial):
+            if found >= sampler.max_sample:
+                break
+            sample_bbox = generate_sample(sampler)
+            if satisfy_sample_constraint(sampler, sample_bbox, bbox_labels):
+                sampled_bbox.append(sample_bbox)
+                found = found + 1
+                index.append(c)
+        c = c + 1
+    return sampled_bbox
+
+
+def clip_bbox(src_bbox):
+    src_bbox.xmin = max(min(src_bbox.xmin, 1.0), 0.0)
+    src_bbox.ymin = max(min(src_bbox.ymin, 1.0), 0.0)
+    src_bbox.xmax = max(min(src_bbox.xmax, 1.0), 0.0)
+    src_bbox.ymax = max(min(src_bbox.ymax, 1.0), 0.0)
+    return src_bbox
+
+
+def meet_emit_constraint(src_bbox, sample_bbox):
+    center_x = (src_bbox.xmax + src_bbox.xmin) / 2
+    center_y = (src_bbox.ymax + src_bbox.ymin) / 2
+    if center_x >= sample_bbox.xmin and \
+        center_x <= sample_bbox.xmax and \
+        center_y >= sample_bbox.ymin and \
+        center_y <= sample_bbox.ymax:
+        return True
+    return False
+
+
+def transform_labels(bbox_labels, sample_bbox):
+    proj_bbox = bbox(0, 0, 0, 0)
+    sample_labels = []
+    for i in range(len(bbox_labels)):
+        sample_label = []
+        object_bbox = bbox(bbox_labels[i][1], bbox_labels[i][2],
+                           bbox_labels[i][3], bbox_labels[i][4])
+        if not meet_emit_constraint(object_bbox, sample_bbox):
+            continue
+        sample_width = sample_bbox.xmax - sample_bbox.xmin
+        sample_height = sample_bbox.ymax - sample_bbox.ymin
+        proj_bbox.xmin = (object_bbox.xmin - sample_bbox.xmin) / sample_width
+        proj_bbox.ymin = (object_bbox.ymin - sample_bbox.ymin) / sample_height
+        proj_bbox.xmax = (object_bbox.xmax - sample_bbox.xmin) / sample_width
+        proj_bbox.ymax = (object_bbox.ymax - sample_bbox.ymin) / sample_height
+        proj_bbox = clip_bbox(proj_bbox)
+        if bbox_area(proj_bbox) > 0:
+            sample_label.append(bbox_labels[i][0])
+            sample_label.append(float(proj_bbox.xmin))
+            sample_label.append(float(proj_bbox.ymin))
+            sample_label.append(float(proj_bbox.xmax))
+            sample_label.append(float(proj_bbox.ymax))
+            sample_label.append(bbox_labels[i][5])
+            sample_labels.append(sample_label)
+    return sample_labels
+
+
+def crop_image(img, bbox_labels, sample_bbox, image_width, image_height):
+    sample_bbox = clip_bbox(sample_bbox)
+    xmin = int(sample_bbox.xmin * image_width)
+    xmax = int(sample_bbox.xmax * image_width)
+    ymin = int(sample_bbox.ymin * image_height)
+    ymax = int(sample_bbox.ymax * image_height)
+    sample_img = img[ymin:ymax, xmin:xmax]
+    sample_labels = transform_labels(bbox_labels, sample_bbox)
+    return sample_img, sample_labels
+
+
+def random_brightness(img, settings):
+    prob = random.uniform(0, 1)
+    if prob < settings._brightness_prob:
+        delta = random.uniform(-settings._brightness_delta,
+                               settings._brightness_delta) + 1
+        img = ImageEnhance.Brightness(img).enhance(delta)
+    return img
+
+
+def random_contrast(img, settings):
+    prob = random.uniform(0, 1)
+    if prob < settings._contrast_prob:
+        delta = random.uniform(-settings._contrast_delta,
+                               settings._contrast_delta) + 1
+        img = ImageEnhance.Contrast(img).enhance(delta)
+    return img
+
+
+def random_saturation(img, settings):
+    prob = random.uniform(0, 1)
+    if prob < settings._saturation_prob:
+        delta = random.uniform(-settings._saturation_delta,
+                               settings._saturation_delta) + 1
+        img = ImageEnhance.Color(img).enhance(delta)
+    return img
+
+
+def random_hue(img, settings):
+    prob = random.uniform(0, 1)
+    if prob < settings._hue_prob:
+        delta = random.uniform(-settings._hue_delta, settings._hue_delta)
+        img_hsv = np.array(img.convert('HSV'))
+        img_hsv[:, :, 0] = img_hsv[:, :, 0] + delta
+        img = Image.fromarray(img_hsv, mode='HSV').convert('RGB')
+    return img
+
+
+def distort_image(img, settings):
+    prob = random.uniform(0, 1)
+    # Apply different distort order
+    if prob > 0.5:
+        img = random_brightness(img, settings)
+        img = random_contrast(img, settings)
+        img = random_saturation(img, settings)
+        img = random_hue(img, settings)
+    else:
+        img = random_brightness(img, settings)
+        img = random_saturation(img, settings)
+        img = random_hue(img, settings)
+        img = random_contrast(img, settings)
+    return img
+
+
+def expand_image(img, bbox_labels, img_width, img_height, settings):
+    prob = random.uniform(0, 1)
+    if prob < settings._hue_prob:
+        expand_ratio = random.uniform(1, settings._expand_max_ratio)
+        if expand_ratio - 1 >= 0.01:
+            height = int(img_height * expand_ratio)
+            width = int(img_width * expand_ratio)
+            h_off = math.floor(random.uniform(0, height - img_height))
+            w_off = math.floor(random.uniform(0, width - img_width))
+            expand_bbox = bbox(-w_off / img_width, -h_off / img_height,
+                               (width - w_off) / img_width,
+                               (height - h_off) / img_height)
+            expand_img = np.ones((height, width, 3))
+            expand_img = np.uint8(expand_img * np.squeeze(settings._img_mean))
+            expand_img = Image.fromarray(expand_img)
+            expand_img.paste(img, (int(w_off), int(h_off)))
+            bbox_labels = transform_labels(bbox_labels, expand_bbox)
+            return expand_img, bbox_labels
+    return img, bbox_labels
diff --git a/fluid/object_detection/load_model.py b/fluid/object_detection/load_model.py
new file mode 100644
index 0000000000..8c7389efea
--- /dev/null
+++ b/fluid/object_detection/load_model.py
@@ -0,0 +1,67 @@
+import paddle.v2 as paddle
+import paddle.fluid as fluid
+import numpy as np
+
+
+# From npy
+def load_vars():
+    vars = {}
+    name_map = {}
+    with open('./ssd_mobilenet_v1_coco/names.map', 'r') as map_file:
+        for param in map_file:
+            fd_name, tf_name = param.strip().split('\t')
+            name_map[fd_name] = tf_name
+
+    tf_vars = np.load(
+        './ssd_mobilenet_v1_coco/ssd_mobilenet_v1_coco_2017_11_17.npy').item()
+    for fd_name in name_map:
+        tf_name = name_map[fd_name]
+        tf_var = tf_vars[tf_name]
+        if len(tf_var.shape) == 4 and 'depthwise' in tf_name:
+            vars[fd_name] = np.transpose(tf_var, (2, 3, 0, 1))
+        elif len(tf_var.shape) == 4:
+            vars[fd_name] = np.transpose(tf_var, (3, 2, 0, 1))
+        else:
+            vars[fd_name] = tf_var
+
+    return vars
+
+
+def load_and_set_vars(place):
+    vars = load_vars()
+    for k, v in vars.items():
+        t = fluid.global_scope().find_var(k).get_tensor()
+        #print(np.array(t).shape, v.shape, k)
+        assert np.array(t).shape == v.shape
+        t.set(v, place)
+
+
+# From Paddle V1
+def load_paddlev1_vars(place):
+    vars = {}
+    name_map = {}
+    with open('./caffe2paddle/names.map', 'r') as map_file:
+        for param in map_file:
+            fd_name, tf_name = param.strip().split('\t')
+            name_map[fd_name] = tf_name
+
+    from operator import mul
+
+    def load(file_name, shape):
+        with open(file_name, 'rb') as f:
+            f.read(16)
+            arr = np.fromfile(f, dtype=np.float32)
+            #print(arr.size, reduce(mul, shape), file_name)
+            assert arr.size == reduce(mul, shape)
+            return arr.reshape(shape)
+
+    for fd_name in name_map:
+        v1_name = name_map[fd_name]
+        t = fluid.global_scope().find_var(fd_name).get_tensor()
+        shape = np.array(t).shape
+        v1_var = load('./caffe2paddle/' + v1_name, shape)
+        t.set(v1_var, place)
+
+
+if __name__ == "__main__":
+    load_vars()
diff --git a/fluid/object_detection/mobilenet_ssd.py b/fluid/object_detection/mobilenet_ssd.py
new file mode 100644
index 0000000000..21869647aa
--- /dev/null
+++ b/fluid/object_detection/mobilenet_ssd.py
@@ -0,0 +1,120 @@
+import paddle.v2 as paddle
+import paddle.fluid as fluid
+from paddle.fluid.initializer import MSRA
+from paddle.fluid.param_attr import ParamAttr
+
+
+def conv_bn(input,
+            filter_size,
+            num_filters,
+            stride,
+            padding,
+            channels=None,
+            num_groups=1,
+            act='relu',
+            use_cudnn=True):
+    parameter_attr = ParamAttr(learning_rate=0.1, initializer=MSRA())
+    conv = fluid.layers.conv2d(
+        input=input,
+        num_filters=num_filters,
+        filter_size=filter_size,
+        stride=stride,
+        padding=padding,
+        groups=num_groups,
+        act=None,
+        use_cudnn=use_cudnn,
+        param_attr=parameter_attr,
+        bias_attr=False)
+    parameter_attr = ParamAttr(learning_rate=0.1, initializer=MSRA())
+    bias_attr = ParamAttr(learning_rate=0.2)
+    return fluid.layers.batch_norm(
+        input=conv,
+        act=act,
+        epsilon=0.00001,
+        param_attr=parameter_attr,
+        bias_attr=bias_attr)
+
+
+def depthwise_separable(input, num_filters1, num_filters2, num_groups, stride,
+                        scale):
+    depthwise_conv = conv_bn(
+        input=input,
+        filter_size=3,
+        num_filters=int(num_filters1 * scale),
+        stride=stride,
+        padding=1,
+        num_groups=int(num_groups * scale),
+        use_cudnn=False)
+
+    pointwise_conv = conv_bn(
+        input=depthwise_conv,
+        filter_size=1,
+        num_filters=int(num_filters2 * scale),
+        stride=1,
+        padding=0)
+    return pointwise_conv
+
+
+def extra_block(input, num_filters1, num_filters2, num_groups, stride, scale):
+    # 1x1 conv
+    pointwise_conv = conv_bn(
+        input=input,
+        filter_size=1,
+        num_filters=int(num_filters1 * scale),
+        stride=1,
+        num_groups=int(num_groups * scale),
+        padding=0)
+
+    # 3x3 conv
+    normal_conv = conv_bn(
+        input=pointwise_conv,
+        filter_size=3,
+        num_filters=int(num_filters2 * scale),
+        stride=2,
+        num_groups=int(num_groups * scale),
+        padding=1)
+    return normal_conv
+
+
+def mobile_net(img, img_shape, scale=1.0):
+    # 300x300
+    tmp = conv_bn(img, 3, int(32 * scale), 2, 1, 3)
+    # 150x150
+    tmp = depthwise_separable(tmp, 32, 64, 32, 1, scale)
+    tmp = depthwise_separable(tmp, 64, 128, 64, 2, scale)
+    # 75x75
+    tmp = depthwise_separable(tmp, 128, 128, 128, 1, scale)
+    tmp = depthwise_separable(tmp, 128, 256, 128, 2, scale)
+    # 38x38
+    tmp = depthwise_separable(tmp, 256, 256, 256, 1, scale)
+    tmp = depthwise_separable(tmp, 256, 512, 256, 2, scale)
+
+    # 19x19
+    for i in range(5):
+        tmp = depthwise_separable(tmp, 512, 512, 512, 1, scale)
+    module11 = tmp
+    tmp = depthwise_separable(tmp, 512, 1024, 512, 2, scale)
+
+    # 10x10
+    module13 = depthwise_separable(tmp, 1024, 1024, 1024, 1, scale)
+    module14 = extra_block(module13, 256, 512, 1, 2, scale)
+    # 5x5
+    module15 = extra_block(module14, 128, 256, 1, 2, scale)
+    # 3x3
+    module16 = extra_block(module15, 128, 256, 1, 2, scale)
+    # 2x2
+    module17 = extra_block(module16, 64, 128, 1, 2, scale)
+    mbox_locs, mbox_confs, box, box_var = fluid.layers.multi_box_head(
+        inputs=[module11, module13, module14, module15, module16, module17],
+        image=img,
+        num_classes=21,
+        min_ratio=20,
+        max_ratio=90,
+        min_sizes=[60.0, 105.0, 150.0, 195.0, 240.0, 285.0],
+        max_sizes=[[], 150.0, 195.0, 240.0, 285.0, 300.0],
+        aspect_ratios=[[2.], [2., 3.], [2., 3.], [2., 3.], [2., 3.], [2., 3.]],
+        base_size=img_shape[2],
+        offset=0.5,
+        flip=True)
+
+    return mbox_locs, mbox_confs, box, box_var
diff --git a/fluid/object_detection/reader.py b/fluid/object_detection/reader.py
new file mode 100644
index 0000000000..6a6beb6e50
--- /dev/null
+++ b/fluid/object_detection/reader.py
@@ -0,0 +1,208 @@
+# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import image_util
+from paddle.utils.image_util import *
+import random
+from PIL import Image
+import numpy as np
+import xml.etree.ElementTree
+import os
+
+
+class Settings(object):
+    def __init__(self, data_dir, label_file, resize_h, resize_w, mean_value,
+                 apply_distort, apply_expand):
+        self._data_dir = data_dir
+        self._label_list = []
+        label_fpath = os.path.join(data_dir, label_file)
+        for line in open(label_fpath):
+            self._label_list.append(line.strip())
+
+        self._apply_distort = apply_distort
+        self._apply_expand = apply_expand
+        self._resize_height = resize_h
+        self._resize_width = resize_w
+        self._img_mean = np.array(mean_value)[:, np.newaxis, np.newaxis].astype(
+            'float32')
+        self._expand_prob = 0.5
+        self._expand_max_ratio = 4
+        self._hue_prob = 0.5
+        self._hue_delta = 18
+        self._contrast_prob = 0.5
+        self._contrast_delta = 0.5
+        self._saturation_prob = 0.5
+        self._saturation_delta = 0.5
+        self._brightness_prob = 0.5
+        self._brightness_delta = 0.125
+
+    @property
+    def apply_distort(self):
+        return self._apply_expand
+
+    @property
+    def apply_distort(self):
+        return self._apply_distort
+
+    @property
+    def data_dir(self):
+        return self._data_dir
+
+    @property
+    def label_list(self):
+        return self._label_list
+
+    @property
+    def resize_h(self):
+        return self._resize_height
+
+    @property
+    def resize_w(self):
+        return self._resize_width
+
+    @property
+    def img_mean(self):
+        return self._img_mean
+
+
+def _reader_creator(settings, file_list, mode, shuffle):
+    def reader():
+        with open(file_list) as flist:
+            lines = [line.strip() for line in flist]
+            if shuffle:
+                random.shuffle(lines)
+            for line in lines:
+                if mode == 'train' or mode == 'test':
+                    img_path, label_path = line.split()
+                    img_path = os.path.join(settings.data_dir, img_path)
+                    label_path = os.path.join(settings.data_dir, label_path)
+                elif mode == 'infer':
+                    img_path = os.path.join(settings.data_dir, line)
+
+                img = Image.open(img_path)
+                img_width, img_height = img.size
+
+                # layout: label | xmin | ymin | xmax | ymax | difficult
+                if mode == 'train' or mode == 'test':
+                    bbox_labels = []
+                    root = xml.etree.ElementTree.parse(label_path).getroot()
+                    for object in root.findall('object'):
+                        bbox_sample = []
+                        # start from 1
+                        bbox_sample.append(
+                            float(
+                                settings.label_list.index(
+                                    object.find('name').text)))
+                        bbox = object.find('bndbox')
+                        difficult = float(object.find('difficult').text)
+                        bbox_sample.append(
+                            float(bbox.find('xmin').text) / img_width)
+                        bbox_sample.append(
+                            float(bbox.find('ymin').text) / img_height)
+                        bbox_sample.append(
+                            float(bbox.find('xmax').text) / img_width)
+                        bbox_sample.append(
+                            float(bbox.find('ymax').text) / img_height)
+                        bbox_sample.append(difficult)
+                        bbox_labels.append(bbox_sample)
+
+                    sample_labels = bbox_labels
+                    if mode == 'train':
+                        if settings._apply_distort:
+                            img = image_util.distort_image(img, settings)
+                        if settings._apply_expand:
+                            img, bbox_labels = image_util.expand_image(
+                                img, bbox_labels, img_width, img_height,
+                                settings)
+                        batch_sampler = []
+                        # hard-code here
+                        batch_sampler.append(
+                            image_util.sampler(1, 1, 1.0, 1.0, 1.0, 1.0, 0.0,
+                                               0.0))
+                        batch_sampler.append(
+                            image_util.sampler(1, 50, 0.3, 1.0, 0.5, 2.0, 0.1,
+                                               0.0))
+                        batch_sampler.append(
+                            image_util.sampler(1, 50, 0.3, 1.0, 0.5, 2.0, 0.3,
+                                               0.0))
+                        batch_sampler.append(
+                            image_util.sampler(1, 50, 0.3, 1.0, 0.5, 2.0, 0.5,
+                                               0.0))
+                        batch_sampler.append(
+                            image_util.sampler(1, 50, 0.3, 1.0, 0.5, 2.0, 0.7,
+                                               0.0))
+                        batch_sampler.append(
+                            image_util.sampler(1, 50, 0.3, 1.0, 0.5, 2.0, 0.9,
+                                               0.0))
+                        batch_sampler.append(
+                            image_util.sampler(1, 50, 0.3, 1.0, 0.5, 2.0, 0.0,
+                                               1.0))
+                        """ random crop """
+                        sampled_bbox = image_util.generate_batch_samples(
+                            batch_sampler, bbox_labels, img_width, img_height)
+
+                        img = np.array(img)
+                        if len(sampled_bbox) > 0:
+                            idx = int(random.uniform(0, len(sampled_bbox)))
+                            img, sample_labels = image_util.crop_image(
+                                img, bbox_labels, sampled_bbox[idx], img_width,
+                                img_height)
+
+                        img = Image.fromarray(img)
+                img = img.resize((settings.resize_w, settings.resize_h),
+                                 Image.ANTIALIAS)
+                img = np.array(img)
+
+                if mode == 'train':
+                    mirror = int(random.uniform(0, 2))
+                    if mirror == 1:
+                        img = img[:, ::-1, :]
+                        for i in xrange(len(sample_labels)):
+                            tmp = sample_labels[i][1]
+                            sample_labels[i][1] = 1 - sample_labels[i][3]
+                            sample_labels[i][3] = 1 - tmp
+
+                if len(img.shape) == 3:
+                    img = np.swapaxes(img, 1, 2)
+                    img = np.swapaxes(img, 1, 0)
+
+                img = img[[2, 1, 0], :, :]
+                img = img.astype('float32')
+                img -= settings.img_mean
+                img = img.flatten()
+                img = img * 0.007843
+
+                sample_labels = np.array(sample_labels)
+                if mode == 'train' or mode == 'test':
+                    if mode == 'train' and len(sample_labels) == 0: continue
+                    yield img.astype(
+                        'float32'
+                    ), sample_labels[:, 1:5], sample_labels[:, 0].astype(
+                        'int32'), sample_labels[:, -1].astype('int32')
+                elif mode == 'infer':
+                    yield img.astype('float32')
+
+    return reader
+
+
+def train(settings, file_list, shuffle=True):
+    return _reader_creator(settings, file_list, 'train', shuffle)
+
+
+def test(settings, file_list):
+    return _reader_creator(settings, file_list, 'test', False)
+
+
+def infer(settings, file_list):
+    return _reader_creator(settings, file_list, 'infer', False)
diff --git a/fluid/object_detection/train.py b/fluid/object_detection/train.py
new file mode 100644
index 0000000000..dbd0c8d39b
--- /dev/null
+++ b/fluid/object_detection/train.py
@@ -0,0 +1,147 @@
+import paddle.v2 as paddle
+import paddle.fluid as fluid
+import reader
+import load_model as load_model
+from mobilenet_ssd import mobile_net
+from utility import add_arguments, print_arguments
+import os
+import numpy as np
+import argparse
+import functools
+
+parser = argparse.ArgumentParser(description=__doc__)
+add_arg = functools.partial(add_arguments, argparser=parser)
+# yapf: disable
+add_arg('parallel',    bool,   True,     "Whether use parallel training.")
+add_arg('use_gpu',     bool,   True,     "Whether use GPU.")
+# yapf: disable
+
+
+def train(args,
+          train_file_list,
+          val_file_list,
+          data_args,
+          learning_rate,
+          batch_size,
+          num_passes,
+          model_save_dir='model',
+          init_model_path=None):
+    image_shape = [3, data_args.resize_h, data_args.resize_w]
+
+    image = fluid.layers.data(name='image', shape=image_shape, dtype='float32')
+    gt_box = fluid.layers.data(
+        name='gt_box', shape=[4], dtype='float32', lod_level=1)
+    gt_label = fluid.layers.data(
+        name='gt_label', shape=[1], dtype='int32', lod_level=1)
+    difficult = fluid.layers.data(
+        name='gt_difficult', shape=[1], dtype='int32', lod_level=1)
+
+    if args.parallel:
+        places = fluid.layers.get_places()
+        pd = fluid.layers.ParallelDo(places)
+        with pd.do():
+            image_ = pd.read_input(image)
+            gt_box_ = pd.read_input(gt_box)
+            gt_label_ = pd.read_input(gt_label)
+            difficult_ = pd.read_input(difficult)
+            locs, confs, box, box_var = mobile_net(image_, image_shape)
+            loss = fluid.layers.ssd_loss(locs, confs, gt_box_, gt_label_,
+                                         box, box_var)
+            pd.write_output(loss)
+            pd.write_output(locs)
+            pd.write_output(confs)
+            pd.write_output(box)
+            pd.write_output(box_var)
+
+        loss, locs, confs, box, box_var = pd()
+        loss = fluid.layers.reduce_sum(loss)
+    else:
+        locs, confs, box, box_var = mobile_net(image, image_shape)
+        nmsed_out = fluid.layers.detection_output(
+            locs, mbox_confs, box, box_var, nms_threshold=0.45)
+        loss = fluid.layers.ssd_loss(locs, mbox_confs, gt_box, gt_label,
+                                     box, box_var)
+        loss = fluid.layers.reduce_sum(loss)
+
+    test_program = fluid.default_main_program().clone(for_test=True)
+    with fluid.program_guard(test_program):
+        nmsed_out = fluid.layers.detection_output(
+            locs, confs, box, box_var, nms_threshold=0.45)
+        map_eval = fluid.evaluator.DetectionMAP(
+            nmsed_out,
+            gt_label,
+            gt_box,
+            difficult,
+            21,
+            overlap_threshold=0.5,
+            evaluate_difficult=False,
+            ap_version='11point')
+
+    boundaries = [40000, 60000]
+    values = [0.001, 0.0005, 0.00025]
+    optimizer = fluid.optimizer.RMSProp(
+        learning_rate=fluid.layers.piecewise_decay(boundaries, values),
+        regularization=fluid.regularizer.L2Decay(0.00005), )
+
+    optimizer.minimize(loss)
+
+    place = fluid.CUDAPlace(0) if args.use_gpu else fluid.CPUPlace()
+    exe = fluid.Executor(place)
+    exe.run(fluid.default_startup_program())
+
+    load_model.load_and_set_vars(place)
+    #load_model.load_paddlev1_vars(place)
+    train_reader = paddle.batch(
+        reader.train(data_args, train_file_list), batch_size=batch_size)
+    test_reader = paddle.batch(
+        reader.test(data_args, val_file_list), batch_size=batch_size)
+    feeder = fluid.DataFeeder(
+        place=place, feed_list=[image, gt_box, gt_label, difficult])
+
+    #print 'test_program ', test_program
+    def test(pass_id):
+        _, accum_map = map_eval.get_map_var()
+        map_eval.reset(exe)
+        test_map = None
+        for _, data in enumerate(test_reader()):
+            test_map = exe.run(test_program,
+                               feed=feeder.feed(data),
+                               fetch_list=[accum_map])
+        print("Test {0}, map {1}".format(pass_id, test_map[0]))
+
+    #print 'main_program ', fluid.default_main_program()
+    for pass_id in range(num_passes):
+        for batch_id, data in enumerate(train_reader()):
+            loss_v = exe.run(fluid.default_main_program(),
+                             feed=feeder.feed(data),
+                             fetch_list=[loss])
+            if batch_id % 20 == 0:
+                print("Pass {0}, batch {1}, loss {2}"
+                      .format(pass_id, batch_id, loss_v[0]))
+        test(pass_id)
+
+        if pass_id % 10 == 0:
+            model_path = os.path.join(model_save_dir, str(pass_id))
+            print 'save models to %s' % (model_path)
+            fluid.io.save_inference_model(model_path, ['image'], [nmsed_out],
+                                          exe)
+
+
+if __name__ == '__main__':
+    args = parser.parse_args()
+    print_arguments(args)
+    data_args = reader.Settings(
+        data_dir='./data',
+        label_file='label_list',
+        apply_distort=True,
+        apply_expand=True,
+        resize_h=300,
+        resize_w=300,
+        mean_value=[127.5, 127.5, 127.5])
+    train(args,
+          train_file_list='./data/trainval.txt',
+          val_file_list='./data/test.txt',
+          data_args=data_args,
+          learning_rate=0.001,
+          batch_size=32,
+          num_passes=300)
diff --git a/fluid/object_detection/utility.py b/fluid/object_detection/utility.py
new file mode 100644
index 0000000000..506e6007ce
--- /dev/null
+++ b/fluid/object_detection/utility.py
@@ -0,0 +1,62 @@
+"""Contains common utility functions."""
+#  Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import distutils.util
+import numpy as np
+from paddle.fluid import core
+
+
+def print_arguments(args):
+    """Print argparse's arguments.
+
+    Usage:
+
+    .. code-block:: python
+
+        parser = argparse.ArgumentParser()
+        parser.add_argument("name", default="Jonh", type=str, help="User name.")
+        args = parser.parse_args()
+        print_arguments(args)
+
+    :param args: Input argparse.Namespace for printing.
+    :type args: argparse.Namespace
+    """
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(vars(args).iteritems()):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
+
+
+def add_arguments(argname, type, default, help, argparser, **kwargs):
+    """Add argparse's argument.
+
+    Usage:
+
+    .. code-block:: python
+
+        parser = argparse.ArgumentParser()
+        add_argument("name", str, "Jonh", "User name.", parser)
+        args = parser.parse_args()
+    """
+    type = distutils.util.strtobool if type == bool else type
+    argparser.add_argument(
+        "--" + argname,
+        default=default,
+        type=type,
+        help=help + ' Default: %(default)s.',
+        **kwargs)
diff --git a/fluid/ocr_recognition/crnn_ctc_model.py b/fluid/ocr_recognition/crnn_ctc_model.py
index 4cf95aea19..945cc334c8 100644
--- a/fluid/ocr_recognition/crnn_ctc_model.py
+++ b/fluid/ocr_recognition/crnn_ctc_model.py
@@ -141,15 +141,55 @@ def encoder_net(images,
 def ctc_train_net(images, label, args, num_classes):
     regularizer = fluid.regularizer.L2Decay(args.l2)
     gradient_clip = None
-    fc_out = encoder_net(
-        images,
-        num_classes,
-        regularizer=regularizer,
-        gradient_clip=gradient_clip)
+    if args.parallel:
+        places = fluid.layers.get_places()
+        pd = fluid.layers.ParallelDo(places)
+        with pd.do():
+            images_ = pd.read_input(images)
+            label_ = pd.read_input(label)
+
+            fc_out = encoder_net(
+                images_,
+                num_classes,
+                regularizer=regularizer,
+                gradient_clip=gradient_clip)
+
+            cost = fluid.layers.warpctc(
+                input=fc_out,
+                label=label_,
+                blank=num_classes,
+                norm_by_times=True)
+            sum_cost = fluid.layers.reduce_sum(cost)
+
+            decoded_out = fluid.layers.ctc_greedy_decoder(
+                input=fc_out, blank=num_classes)
+
+            pd.write_output(sum_cost)
+            pd.write_output(decoded_out)
+
+        sum_cost, decoded_out = pd()
+        sum_cost = fluid.layers.reduce_sum(sum_cost)
+
+    else:
+        fc_out = encoder_net(
+            images,
+            num_classes,
+            regularizer=regularizer,
+            gradient_clip=gradient_clip)
+
+        cost = fluid.layers.warpctc(
+            input=fc_out, label=label, blank=num_classes, norm_by_times=True)
+        sum_cost = fluid.layers.reduce_sum(cost)
+        decoded_out = fluid.layers.ctc_greedy_decoder(
+            input=fc_out, blank=num_classes)
 
-    cost = fluid.layers.warpctc(
-        input=fc_out, label=label, blank=num_classes, norm_by_times=True)
-    sum_cost = fluid.layers.reduce_sum(cost)
+    casted_label = fluid.layers.cast(x=label, dtype='int64')
+    error_evaluator = fluid.evaluator.EditDistance(
+        input=decoded_out, label=casted_label)
+
+    inference_program = fluid.default_main_program().clone()
+    with fluid.program_guard(inference_program):
+        inference_program = fluid.io.get_inference_program(error_evaluator)
 
     optimizer = fluid.optimizer.Momentum(
         learning_rate=args.learning_rate, momentum=args.momentum)
@@ -166,7 +206,7 @@ def ctc_train_net(images, label, args, num_classes):
     casted_label = fluid.layers.cast(x=label, dtype='int64')
     error_evaluator = fluid.evaluator.EditDistance(
         input=decoded_out, label=casted_label)
-    return sum_cost, error_evaluator, model_average
+    return sum_cost, error_evaluator, inference_program, model_average
 
 
 def ctc_infer(images, num_classes):
diff --git a/fluid/ocr_recognition/ctc_train.py b/fluid/ocr_recognition/ctc_train.py
index 4a68ebdd2e..a02017ccd0 100644
--- a/fluid/ocr_recognition/ctc_train.py
+++ b/fluid/ocr_recognition/ctc_train.py
@@ -27,7 +27,7 @@
 add_arg('min_average_window',     int,   10000,     "Min average window.")
 add_arg('max_average_window',     int,   15625,     "Max average window.")
 add_arg('average_window',     float,   0.15,     "Average window.")
-
+add_arg('parallel',     bool,   True,     "Whether use parallel training.")
 # yapf: disable
 
 def load_parameter(place):
@@ -44,7 +44,7 @@ def train(args, data_reader=dummy_reader):
     # define network
     images = fluid.layers.data(name='pixel', shape=data_shape, dtype='float32')
     label = fluid.layers.data(name='label', shape=[1], dtype='int32', lod_level=1)
-    sum_cost, error_evaluator, model_average = ctc_train_net(images, label, args, num_classes)
+    sum_cost, error_evaluator, inference_program, model_average = ctc_train_net(images, label, args, num_classes)
 
     # data reader
     train_reader = data_reader.train(args.batch_size)
@@ -57,8 +57,6 @@ def train(args, data_reader=dummy_reader):
     exe.run(fluid.default_startup_program())
     #load_parameter(place)
 
-    inference_program = fluid.io.get_inference_program(error_evaluator)
-
     for pass_id in range(args.pass_num):
         error_evaluator.reset(exe)
         batch_id = 1
diff --git a/fluid/policy_gradient/README.md b/fluid/policy_gradient/README.md
index 7db11fc44b..b813aa1244 100644
--- a/fluid/policy_gradient/README.md
+++ b/fluid/policy_gradient/README.md
@@ -1,4 +1,8 @@
-﻿# Policy Gradient RL by PaddlePaddle
+﻿运行本目录下的程序示例需要使用PaddlePaddle的最新develop分枝。如果您的PaddlePaddle安装版本低于此要求，请按照[安装文档](http://www.paddlepaddle.org/docs/develop/documentation/zh/build_and_install/pip_install_cn.html)中的说明更新PaddlePaddle安装版本。
+
+---
+
+# Policy Gradient RL by PaddlePaddle
 本文介绍了如何使用PaddlePaddle通过policy-based的强化学习方法来训练一个player（actor model）, 我们希望这个player可以完成简单的走阶梯任务。
 
  内容分为:
diff --git a/fluid/policy_gradient/brain.py b/fluid/policy_gradient/brain.py
index bf247932a4..8387833065 100644
--- a/fluid/policy_gradient/brain.py
+++ b/fluid/policy_gradient/brain.py
@@ -1,6 +1,6 @@
 import numpy as np
 import paddle.v2 as paddle
-import paddle.v2.fluid as fluid
+import paddle.fluid as fluid
 # reproducible
 np.random.seed(1)
 
diff --git a/fluid/sequence_tagging_for_ner/README.md b/fluid/sequence_tagging_for_ner/README.md
new file mode 100644
index 0000000000..1f634da4e2
--- /dev/null
+++ b/fluid/sequence_tagging_for_ner/README.md
@@ -0,0 +1,120 @@
+# 命名实体识别
+
+以下是本例的简要目录结构及说明：
+
+```text
+.
+├── data                 # 存储运行本例所依赖的数据，从外部获取
+├── network_conf.py      # 模型定义
+├── reader.py            # 数据读取接口, 从外部获取
+├── README.md            # 文档
+├── train.py             # 训练脚本
+├── infer.py             # 预测脚本
+├── utils.py             # 定义通用的函数, 从外部获取
+└── utils_extend.py      # 对utils.py的拓展
+```
+
+
+## 简介，模型详解
+
+在PaddlePaddle v2版本[命名实体识别](https://github.com/PaddlePaddle/models/blob/develop/sequence_tagging_for_ner/README.md)中对于命名实体识别任务有较详细的介绍，在本例中不再重复介绍。
+在模型上，我们沿用了v2版本的模型结构，唯一区别是我们使用LSTM代替原始的RNN。
+
+## 数据获取
+
+请参考PaddlePaddle v2版本[命名实体识别](https://github.com/PaddlePaddle/models/blob/develop/sequence_tagging_for_ner/README.md) 一节中数据获取方式，将该例中的data文件夹拷贝至本例目录下，运行其中的download.sh脚本获取训练和测试数据。
+
+## 通用脚本获取
+
+请将PaddlePaddle v2版本[命名实体识别](https://github.com/PaddlePaddle/models/blob/develop/sequence_tagging_for_ner/README.md)中提供的用于数据读取的文件[reader.py](https://github.com/PaddlePaddle/models/blob/develop/sequence_tagging_for_ner/reader.py)以及包含字典导入等通用功能的文件[utils.py](https://github.com/PaddlePaddle/models/blob/develop/sequence_tagging_for_ner/utils.py)复制到本目录下。本例将会使用到这两个脚本。
+
+## 训练
+
+1. 运行 `sh data/download.sh`
+2. 修改 `train.py` 的 `main` 函数，指定数据路径
+
+    ```python
+    main(
+        train_data_file="data/train",
+        test_data_file="data/test",
+        vocab_file="data/vocab.txt",
+        target_file="data/target.txt",
+        emb_file="data/wordVectors.txt",
+        model_save_dir="models",
+        num_passes=1000,
+        use_gpu=False,
+        parallel=False)
+    ```
+
+3. 运行命令 `python train.py` ，**需要注意：直接运行使用的是示例数据，请替换真实的标记数据。**
+
+    ```text
+    Pass 127, Batch 9525, Cost 4.0867705, Precision 0.3954984, Recall 0.37846154, F1_score0.38679245
+    Pass 127, Batch 9530, Cost 3.137265, Precision 0.42971888, Recall 0.38351256, F1_score0.405303
+    Pass 127, Batch 9535, Cost 3.6240938, Precision 0.4272152, Recall 0.41795665, F1_score0.4225352
+    Pass 127, Batch 9540, Cost 3.5352352, Precision 0.48464164, Recall 0.4536741, F1_score0.46864685
+    Pass 127, Batch 9545, Cost 4.1130385, Precision 0.40131578, Recall 0.3836478, F1_score0.39228293
+    Pass 127, Batch 9550, Cost 3.6826708, Precision 0.43333334, Recall 0.43730888, F1_score0.43531203
+    Pass 127, Batch 9555, Cost 3.6363933, Precision 0.42424244, Recall 0.3962264, F1_score0.4097561
+    Pass 127, Batch 9560, Cost 3.6101768, Precision 0.51363635, Recall 0.353125, F1_score0.41851854
+    Pass 127, Batch 9565, Cost 3.5935276, Precision 0.5152439, Recall 0.5, F1_score0.5075075
+    Pass 127, Batch 9570, Cost 3.4987144, Precision 0.5, Recall 0.4330218, F1_score0.46410686
+    Pass 127, Batch 9575, Cost 3.4659843, Precision 0.39864865, Recall 0.38064516, F1_score0.38943896
+    Pass 127, Batch 9580, Cost 3.1702557, Precision 0.5, Recall 0.4490446, F1_score0.47315437
+    Pass 127, Batch 9585, Cost 3.1587276, Precision 0.49377593, Recall 0.4089347, F1_score0.4473684
+    Pass 127, Batch 9590, Cost 3.5043538, Precision 0.4556962, Recall 0.4600639, F1_score0.45786962
+    Pass 127, Batch 9595, Cost 2.981989, Precision 0.44981414, Recall 0.45149255, F1_score0.4506518
+    [TrainSet] pass_id:127 pass_precision:[0.46023396] pass_recall:[0.43197003] pass_f1_score:[0.44565433]
+    [TestSet] pass_id:127 pass_precision:[0.4708409] pass_recall:[0.47971722] pass_f1_score:[0.4752376]
+    ```
+## 预测
+1. 修改 [infer.py](./infer.py) 的 `infer` 函数，指定：需要测试的模型的路径、测试数据、字典文件，预测标记文件的路径，默认参数如下：
+
+    ```python
+    infer(
+        model_path="models/params_pass_0",
+        batch_size=6,
+        test_data_file="data/test",
+        vocab_file="data/vocab.txt",
+        target_file="data/target.txt",
+        use_gpu=False
+    )
+    ```
+
+2. 在终端运行 `python infer.py`，开始测试，会看到如下预测结果（以下为训练70个pass所得模型的部分预测结果）：
+
+    ```text
+    leicestershire    B-ORG    B-LOC
+    extended    O    O
+    their    O    O
+    first    O    O
+    innings    O    O
+    by    O    O
+    DGDG    O    O
+    runs    O    O
+    before    O    O
+    being    O    O
+    bowled    O    O
+    out    O    O
+    for    O    O
+    296    O    O
+    with    O    O
+    england    B-LOC    B-LOC
+    discard    O    O
+    andy    B-PER    B-PER
+    caddick    I-PER    I-PER
+    taking    O    O
+    three    O    O
+    for    O    O
+    DGDG    O    O
+    .    O    O
+    ```
+
+    输出分为三列，以“\t” 分隔，第一列是输入的词语，第二列是标准结果，第三列为生成的标记结果。多条输入序列之间以空行分隔。
+
+## 结果示例
+
+<p align="center">
+<img src="imgs/convergence_curve.png" width="80%" align="center"/><br/>
+图1. 学习曲线, 横轴表示训练轮数，纵轴表示F1值
+</p>
diff --git a/fluid/sequence_tagging_for_ner/imgs/convergence_curve.png b/fluid/sequence_tagging_for_ner/imgs/convergence_curve.png
new file mode 100644
index 0000000000..6b862b751d
Binary files /dev/null and b/fluid/sequence_tagging_for_ner/imgs/convergence_curve.png differ
diff --git a/fluid/sequence_tagging_for_ner/infer.py b/fluid/sequence_tagging_for_ner/infer.py
new file mode 100644
index 0000000000..2d0bd9496e
--- /dev/null
+++ b/fluid/sequence_tagging_for_ner/infer.py
@@ -0,0 +1,71 @@
+import numpy as np
+
+import paddle.fluid as fluid
+import paddle.v2 as paddle
+
+from network_conf import ner_net
+import reader
+from utils import load_dict, load_reverse_dict
+from utils_extend import to_lodtensor
+
+
+def infer(model_path, batch_size, test_data_file, vocab_file, target_file,
+          use_gpu):
+    """
+    use the model under model_path to predict the test data, the result will be printed on the screen
+
+    return nothing
+    """
+    word_dict = load_dict(vocab_file)
+    word_reverse_dict = load_reverse_dict(vocab_file)
+
+    label_dict = load_dict(target_file)
+    label_reverse_dict = load_reverse_dict(target_file)
+
+    test_data = paddle.batch(
+        reader.data_reader(test_data_file, word_dict, label_dict),
+        batch_size=batch_size)
+    place = fluid.CUDAPlace(0) if use_gpu else fluid.CPUPlace()
+    exe = fluid.Executor(place)
+
+    inference_scope = fluid.core.Scope()
+    with fluid.scope_guard(inference_scope):
+        [inference_program, feed_target_names,
+         fetch_targets] = fluid.io.load_inference_model(model_path, exe)
+        for data in test_data():
+            word = to_lodtensor(map(lambda x: x[0], data), place)
+            mark = to_lodtensor(map(lambda x: x[1], data), place)
+            target = to_lodtensor(map(lambda x: x[2], data), place)
+            crf_decode = exe.run(
+                inference_program,
+                feed={"word": word,
+                      "mark": mark,
+                      "target": target},
+                fetch_list=fetch_targets,
+                return_numpy=False)
+            lod_info = (crf_decode[0].lod())[0]
+            np_data = np.array(crf_decode[0])
+            assert len(data) == len(lod_info) - 1
+            for sen_index in xrange(len(data)):
+                assert len(data[sen_index][0]) == lod_info[
+                    sen_index + 1] - lod_info[sen_index]
+                word_index = 0
+                for tag_index in xrange(lod_info[sen_index],
+                                        lod_info[sen_index + 1]):
+                    word = word_reverse_dict[data[sen_index][0][word_index]]
+                    gold_tag = label_reverse_dict[data[sen_index][2][
+                        word_index]]
+                    tag = label_reverse_dict[np_data[tag_index][0]]
+                    print word + "\t" + gold_tag + "\t" + tag
+                    word_index += 1
+                print ""
+
+
+if __name__ == "__main__":
+    infer(
+        model_path="models/params_pass_0",
+        batch_size=6,
+        test_data_file="data/test",
+        vocab_file="data/vocab.txt",
+        target_file="data/target.txt",
+        use_gpu=False)
diff --git a/fluid/sequence_tagging_for_ner/network_conf.py b/fluid/sequence_tagging_for_ner/network_conf.py
new file mode 100644
index 0000000000..5eaa704f67
--- /dev/null
+++ b/fluid/sequence_tagging_for_ner/network_conf.py
@@ -0,0 +1,127 @@
+import math
+
+import paddle.fluid as fluid
+from paddle.fluid.initializer import NormalInitializer
+
+from utils import logger, load_dict, get_embedding
+
+
+def ner_net(word_dict_len, label_dict_len, parallel, stack_num=2):
+    mark_dict_len = 2
+    word_dim = 50
+    mark_dim = 5
+    hidden_dim = 300
+    IS_SPARSE = True
+    embedding_name = 'emb'
+
+    def _net_conf(word, mark, target):
+        word_embedding = fluid.layers.embedding(
+            input=word,
+            size=[word_dict_len, word_dim],
+            dtype='float32',
+            is_sparse=IS_SPARSE,
+            param_attr=fluid.ParamAttr(
+                name=embedding_name, trainable=False))
+
+        mark_embedding = fluid.layers.embedding(
+            input=mark,
+            size=[mark_dict_len, mark_dim],
+            dtype='float32',
+            is_sparse=IS_SPARSE)
+
+        word_caps_vector = fluid.layers.concat(
+            input=[word_embedding, mark_embedding], axis=1)
+        mix_hidden_lr = 1
+
+        rnn_para_attr = fluid.ParamAttr(
+            initializer=NormalInitializer(
+                loc=0.0, scale=0.0),
+            learning_rate=mix_hidden_lr)
+        hidden_para_attr = fluid.ParamAttr(
+            initializer=NormalInitializer(
+                loc=0.0, scale=(1. / math.sqrt(hidden_dim) / 3)),
+            learning_rate=mix_hidden_lr)
+
+        hidden = fluid.layers.fc(
+            input=word_caps_vector,
+            name="__hidden00__",
+            size=hidden_dim,
+            act="tanh",
+            bias_attr=fluid.ParamAttr(initializer=NormalInitializer(
+                loc=0.0, scale=(1. / math.sqrt(hidden_dim) / 3))),
+            param_attr=fluid.ParamAttr(initializer=NormalInitializer(
+                loc=0.0, scale=(1. / math.sqrt(hidden_dim) / 3))))
+        fea = []
+        for direction in ["fwd", "bwd"]:
+            for i in range(stack_num):
+                if i != 0:
+                    hidden = fluid.layers.fc(
+                        name="__hidden%02d_%s__" % (i, direction),
+                        size=hidden_dim,
+                        act="stanh",
+                        bias_attr=fluid.ParamAttr(initializer=NormalInitializer(
+                            loc=0.0, scale=1.0)),
+                        input=[hidden, rnn[0], rnn[1]],
+                        param_attr=[
+                            hidden_para_attr, rnn_para_attr, rnn_para_attr
+                        ])
+                rnn = fluid.layers.dynamic_lstm(
+                    name="__rnn%02d_%s__" % (i, direction),
+                    input=hidden,
+                    size=hidden_dim,
+                    candidate_activation='relu',
+                    gate_activation='sigmoid',
+                    cell_activation='sigmoid',
+                    bias_attr=fluid.ParamAttr(initializer=NormalInitializer(
+                        loc=0.0, scale=1.0)),
+                    is_reverse=(i % 2) if direction == "fwd" else not i % 2,
+                    param_attr=rnn_para_attr)
+            fea += [hidden, rnn[0], rnn[1]]
+
+        rnn_fea = fluid.layers.fc(
+            size=hidden_dim,
+            bias_attr=fluid.ParamAttr(initializer=NormalInitializer(
+                loc=0.0, scale=(1. / math.sqrt(hidden_dim) / 3))),
+            act="stanh",
+            input=fea,
+            param_attr=[hidden_para_attr, rnn_para_attr, rnn_para_attr] * 2)
+
+        emission = fluid.layers.fc(
+            size=label_dict_len,
+            input=rnn_fea,
+            param_attr=fluid.ParamAttr(initializer=NormalInitializer(
+                loc=0.0, scale=(1. / math.sqrt(hidden_dim) / 3))))
+
+        crf_cost = fluid.layers.linear_chain_crf(
+            input=emission,
+            label=target,
+            param_attr=fluid.ParamAttr(
+                name='crfw',
+                initializer=NormalInitializer(
+                    loc=0.0, scale=(1. / math.sqrt(hidden_dim) / 3)),
+                learning_rate=mix_hidden_lr))
+        avg_cost = fluid.layers.mean(x=crf_cost)
+        return avg_cost, emission
+
+    word = fluid.layers.data(name='word', shape=[1], dtype='int64', lod_level=1)
+    mark = fluid.layers.data(name='mark', shape=[1], dtype='int64', lod_level=1)
+    target = fluid.layers.data(
+        name="target", shape=[1], dtype='int64', lod_level=1)
+
+    if parallel:
+        places = fluid.layers.get_places()
+        pd = fluid.layers.ParallelDo(places)
+        with pd.do():
+            word_ = pd.read_input(word)
+            mark_ = pd.read_input(mark)
+            target_ = pd.read_input(target)
+            avg_cost, emission_base = _net_conf(word_, mark_, target_)
+            pd.write_output(avg_cost)
+            pd.write_output(emission_base)
+        avg_cost_list, emission = pd()
+        avg_cost = fluid.layers.mean(x=avg_cost_list)
+        emission.stop_gradient = True
+    else:
+        avg_cost, emission = _net_conf(word, mark, target)
+
+    return avg_cost, emission, word, mark, target
diff --git a/fluid/sequence_tagging_for_ner/train.py b/fluid/sequence_tagging_for_ner/train.py
new file mode 100644
index 0000000000..6ed77cd5ca
--- /dev/null
+++ b/fluid/sequence_tagging_for_ner/train.py
@@ -0,0 +1,122 @@
+import os
+import math
+import numpy as np
+
+import paddle.v2 as paddle
+import paddle.fluid as fluid
+
+import reader
+from network_conf import ner_net
+from utils import logger, load_dict
+from utils_extend import to_lodtensor, get_embedding
+
+
+def test(exe, chunk_evaluator, inference_program, test_data, place):
+    chunk_evaluator.reset(exe)
+    for data in test_data():
+        word = to_lodtensor(map(lambda x: x[0], data), place)
+        mark = to_lodtensor(map(lambda x: x[1], data), place)
+        target = to_lodtensor(map(lambda x: x[2], data), place)
+        acc = exe.run(inference_program,
+                      feed={"word": word,
+                            "mark": mark,
+                            "target": target})
+    return chunk_evaluator.eval(exe)
+
+
+def main(train_data_file, test_data_file, vocab_file, target_file, emb_file,
+         model_save_dir, num_passes, use_gpu, parallel):
+    if not os.path.exists(model_save_dir):
+        os.mkdir(model_save_dir)
+
+    BATCH_SIZE = 200
+    word_dict = load_dict(vocab_file)
+    label_dict = load_dict(target_file)
+
+    word_vector_values = get_embedding(emb_file)
+
+    word_dict_len = len(word_dict)
+    label_dict_len = len(label_dict)
+
+    avg_cost, feature_out, word, mark, target = ner_net(
+        word_dict_len, label_dict_len, parallel)
+
+    sgd_optimizer = fluid.optimizer.SGD(learning_rate=1e-3)
+    sgd_optimizer.minimize(avg_cost)
+
+    crf_decode = fluid.layers.crf_decoding(
+        input=feature_out, param_attr=fluid.ParamAttr(name='crfw'))
+
+    chunk_evaluator = fluid.evaluator.ChunkEvaluator(
+        input=crf_decode,
+        label=target,
+        chunk_scheme="IOB",
+        num_chunk_types=int(math.ceil((label_dict_len - 1) / 2.0)))
+
+    inference_program = fluid.default_main_program().clone()
+    with fluid.program_guard(inference_program):
+        test_target = chunk_evaluator.metrics + chunk_evaluator.states
+        inference_program = fluid.io.get_inference_program(test_target)
+
+    train_reader = paddle.batch(
+        paddle.reader.shuffle(
+            reader.data_reader(train_data_file, word_dict, label_dict),
+            buf_size=20000),
+        batch_size=BATCH_SIZE)
+    test_reader = paddle.batch(
+        paddle.reader.shuffle(
+            reader.data_reader(test_data_file, word_dict, label_dict),
+            buf_size=20000),
+        batch_size=BATCH_SIZE)
+
+    place = fluid.CUDAPlace(0) if use_gpu else fluid.CPUPlace()
+    feeder = fluid.DataFeeder(feed_list=[word, mark, target], place=place)
+    exe = fluid.Executor(place)
+
+    exe.run(fluid.default_startup_program())
+
+    embedding_name = 'emb'
+    embedding_param = fluid.global_scope().find_var(embedding_name).get_tensor()
+    embedding_param.set(word_vector_values, place)
+
+    batch_id = 0
+    for pass_id in xrange(num_passes):
+        chunk_evaluator.reset(exe)
+        for data in train_reader():
+            cost, batch_precision, batch_recall, batch_f1_score = exe.run(
+                fluid.default_main_program(),
+                feed=feeder.feed(data),
+                fetch_list=[avg_cost] + chunk_evaluator.metrics)
+            if batch_id % 5 == 0:
+                print("Pass " + str(pass_id) + ", Batch " + str(
+                    batch_id) + ", Cost " + str(cost[0]) + ", Precision " + str(
+                        batch_precision[0]) + ", Recall " + str(batch_recall[0])
+                      + ", F1_score" + str(batch_f1_score[0]))
+            batch_id = batch_id + 1
+
+        pass_precision, pass_recall, pass_f1_score = chunk_evaluator.eval(exe)
+        print("[TrainSet] pass_id:" + str(pass_id) + " pass_precision:" + str(
+            pass_precision) + " pass_recall:" + str(pass_recall) +
+              " pass_f1_score:" + str(pass_f1_score))
+        pass_precision, pass_recall, pass_f1_score = test(
+            exe, chunk_evaluator, inference_program, test_reader, place)
+        print("[TestSet] pass_id:" + str(pass_id) + " pass_precision:" + str(
+            pass_precision) + " pass_recall:" + str(pass_recall) +
+              " pass_f1_score:" + str(pass_f1_score))
+
+        save_dirname = os.path.join(model_save_dir, "params_pass_%d" % pass_id)
+        fluid.io.save_inference_model(save_dirname, ['word', 'mark', 'target'],
+                                      [crf_decode], exe)
+
+
+if __name__ == "__main__":
+    main(
+        train_data_file="data/train",
+        test_data_file="data/test",
+        vocab_file="data/vocab.txt",
+        target_file="data/target.txt",
+        emb_file="data/wordVectors.txt",
+        model_save_dir="models",
+        num_passes=1000,
+        use_gpu=False,
+        parallel=False)
diff --git a/fluid/sequence_tagging_for_ner/utils_extend.py b/fluid/sequence_tagging_for_ner/utils_extend.py
new file mode 100644
index 0000000000..03e7e62fd5
--- /dev/null
+++ b/fluid/sequence_tagging_for_ner/utils_extend.py
@@ -0,0 +1,28 @@
+import numpy as np
+
+import paddle.fluid as fluid
+
+
+def get_embedding(emb_file='data/wordVectors.txt'):
+    """
+    Get the trained word vector.
+    """
+    return np.loadtxt(emb_file, dtype='float32')
+
+
+def to_lodtensor(data, place):
+    """
+    convert data to lodtensor
+    """
+    seq_lens = [len(seq) for seq in data]
+    cur_len = 0
+    lod = [cur_len]
+    for l in seq_lens:
+        cur_len += l
+        lod.append(cur_len)
+    flattened_data = np.concatenate(data, axis=0).astype("int64")
+    flattened_data = flattened_data.reshape([len(flattened_data), 1])
+    res = fluid.LoDTensor()
+    res.set(flattened_data, place)
+    res.set_lod([lod])
+    return res
diff --git a/fluid/text_classification/README.md b/fluid/text_classification/README.md
index 40df3211d7..500ee6ae6d 100644
--- a/fluid/text_classification/README.md
+++ b/fluid/text_classification/README.md
@@ -1,3 +1,7 @@
+The minimum PaddlePaddle version needed for the code sample in this directory is the lastest develop branch. If you are on a version of PaddlePaddle earlier than this, [please update your installation](http://www.paddlepaddle.org/docs/develop/documentation/en/build_and_install/pip_install_en.html).
+
+---
+
 # Text Classification
 
 ## Data Preparation
diff --git a/fluid/text_classification/train.py b/fluid/text_classification/train.py
index 98f63f0867..d32e1c4c87 100644
--- a/fluid/text_classification/train.py
+++ b/fluid/text_classification/train.py
@@ -5,7 +5,7 @@
 import time
 
 import paddle.v2 as paddle
-import paddle.v2.fluid as fluid
+import paddle.fluid as fluid
 
 from config import TrainConfig as conf
 
@@ -89,12 +89,14 @@ def main(dict_path):
     sgd_optimizer = fluid.optimizer.SGD(learning_rate=conf.learning_rate)
     sgd_optimizer.minimize(avg_cost)
 
-    accuracy = fluid.evaluator.Accuracy(input=prediction, label=label)
+    batch_size_var = fluid.layers.create_tensor(dtype='int64')
+    batch_acc_var = fluid.layers.accuracy(
+        input=prediction, label=label, total=batch_size_var)
 
     inference_program = fluid.default_main_program().clone()
     with fluid.program_guard(inference_program):
-        test_target = accuracy.metrics + accuracy.states
-        inference_program = fluid.io.get_inference_program(test_target)
+        inference_program = fluid.io.get_inference_program(
+            target_vars=[batch_acc_var, batch_size_var])
 
     # The training data set.
     train_reader = paddle.batch(
@@ -119,31 +121,37 @@ def main(dict_path):
 
     exe.run(fluid.default_startup_program())
 
+    train_pass_acc_evaluator = fluid.average.WeightedAverage()
+    test_pass_acc_evaluator = fluid.average.WeightedAverage()
+
     def test(exe):
-        accuracy.reset(exe)
+        test_pass_acc_evaluator.reset()
         for batch_id, data in enumerate(test_reader()):
             input_seq = to_lodtensor(map(lambda x: x[0], data), place)
             y_data = np.array(map(lambda x: x[1], data)).astype("int64")
             y_data = y_data.reshape([-1, 1])
-            acc = exe.run(inference_program,
-                          feed={"words": input_seq,
-                                "label": y_data})
-        test_acc = accuracy.eval(exe)
+            b_acc, b_size = exe.run(inference_program,
+                                    feed={"words": input_seq,
+                                          "label": y_data},
+                                    fetch_list=[batch_acc_var, batch_size_var])
+            test_pass_acc_evaluator.add(value=b_acc, weight=b_size)
+        test_acc = test_pass_acc_evaluator.eval()
         return test_acc
 
     total_time = 0.
     for pass_id in xrange(conf.num_passes):
-        accuracy.reset(exe)
+        train_pass_acc_evaluator.reset()
         start_time = time.time()
         for batch_id, data in enumerate(train_reader()):
-            cost_val, acc_val = exe.run(
+            cost_val, acc_val, size_val = exe.run(
                 fluid.default_main_program(),
                 feed=feeder.feed(data),
-                fetch_list=[avg_cost, accuracy.metrics[0]])
-            pass_acc = accuracy.eval(exe)
+                fetch_list=[avg_cost, batch_acc_var, batch_size_var])
+            train_pass_acc_evaluator.add(value=acc_val, weight=size_val)
             if batch_id and batch_id % conf.log_period == 0:
-                print("Pass id: %d, batch id: %d, cost: %f, pass_acc %f" %
-                      (pass_id, batch_id, cost_val, pass_acc))
+                print("Pass id: %d, batch id: %d, cost: %f, pass_acc: %f" %
+                      (pass_id, batch_id, cost_val,
+                       train_pass_acc_evaluator.eval()))
         end_time = time.time()
         total_time += (end_time - start_time)
         pass_test_acc = test(exe)
diff --git a/generate_chinese_poetry/README.md b/generate_chinese_poetry/README.md
index 1f6bef0da8..c1ea001090 100644
--- a/generate_chinese_poetry/README.md
+++ b/generate_chinese_poetry/README.md
@@ -1,3 +1,7 @@
+运行本目录下的程序示例需要使用PaddlePaddle v0.10.0版本。如果您的PaddlePaddle安装版本低于此要求，请按照[安装文档](http://www.paddlepaddle.org/docs/develop/documentation/zh/build_and_install/pip_install_cn.html)中的说明更新PaddlePaddle安装版本。
+
+---
+
 # 中国古诗生成
 
 ## 简介
diff --git a/generate_sequence_by_rnn_lm/README.md b/generate_sequence_by_rnn_lm/README.md
index b804e52854..afa543334f 100644
--- a/generate_sequence_by_rnn_lm/README.md
+++ b/generate_sequence_by_rnn_lm/README.md
@@ -1,3 +1,7 @@
+运行本目录下的程序示例需要使用PaddlePaddle v0.10.0 版本。如果您的PaddlePaddle安装版本低于此要求，请按照[安装文档](http://www.paddlepaddle.org/docs/develop/documentation/zh/build_and_install/pip_install_cn.html)中的说明更新PaddlePaddle安装版本。
+
+---
+
 # 使用循环神经网语言模型生成文本
 
 语言模型(Language Model)是一个概率分布模型，简单来说，就是用来计算一个句子的概率的模型。利用它可以确定哪个词序列的可能性更大，或者给定若干个词，可以预测下一个最可能出现的词。语言模型是自然语言处理领域里一个重要的基础模型。
diff --git a/globally_normalized_reader/README.cn.md b/globally_normalized_reader/README.cn.md
new file mode 100644
index 0000000000..b1d3910754
--- /dev/null
+++ b/globally_normalized_reader/README.cn.md
@@ -0,0 +1,59 @@
+此目录中代码示例PaddlePaddle所需版本至少为v0.11.0。如果您使用的PaddlePaddle版本早于v0.11.0， [请更新](http://www.paddlepaddle.org/docs/develop/documentation/en/build_and_install/pip_install_en.html).
+
+---
+
+# 全球标准化阅读器
+
+该模型实现以下功能：
+
+Jonathan Raiman and John Miller. Globally Normalized Reader. Empirical Methods in Natural Language Processing (EMNLP), 2017
+
+如果您在研究中使用数据集/代码，请引用上述论文：
+
+```text
+@inproceedings{raiman2015gnr,
+    author={Raiman, Jonathan and Miller, John},
+    booktitle={Empirical Methods in Natural Language Processing (EMNLP)},
+    title={Globally Normalized Reader},
+    year={2017},
+}
+```
+
+您也可以访问 https://github.com/baidu-research/GloballyNormalizedReader 以获取更多信息。
+
+
+# 安装
+
+1. 请使用 [docker image](http://doc.paddlepaddle.org/develop/doc/getstarted/build_and_install/docker_install_en.html) 安装最新的PaddlePaddle，运行方法：
+    ```bash
+    docker pull paddledev/paddle
+    ```
+2. 下载所有必要的数据，运行方法：
+    ```bash
+    cd data && ./download.sh && cd ..
+    ```
+3. 预处理并特征化数据：
+    ```bash
+    python featurize.py --datadir data --outdir data/featurized  --glove-path data/glove.840B.300d.txt
+    ```
+
+# 模型训练
+
+- 根据需要修改config.py来配置模型，然后运行：
+
+    ```bash
+    python train.py 2>&1 | tee train.log
+    ```
+
+# 使用训练过的模型推断
+
+- 运行以下训练模型来推断：
+   ```bash
+   python infer.py \
+     --model_path models/pass_00000.tar.gz \
+     --data_dir data/featurized/ \
+     --batch_size 2 \
+     --use_gpu 0 \
+     --trainer_count 1 \
+     2>&1 | tee infer.log
+   ```
diff --git a/globally_normalized_reader/README.md b/globally_normalized_reader/README.md
index ca223ac75b..9763a1c04f 100644
--- a/globally_normalized_reader/README.md
+++ b/globally_normalized_reader/README.md
@@ -1,3 +1,7 @@
+The minimum PaddlePaddle version needed for the code sample in this directory is v0.11.0. If you are on a version of PaddlePaddle earlier than v0.11.0, [please update your installation](http://www.paddlepaddle.org/docs/develop/documentation/en/build_and_install/pip_install_en.html).
+
+---
+
 # Globally Normalized Reader
 
 This model implements the work in the following paper:
diff --git a/hsigmoid/README.md b/hsigmoid/README.md
index 5e891bce4e..619fc190ac 100644
--- a/hsigmoid/README.md
+++ b/hsigmoid/README.md
@@ -1,3 +1,7 @@
+运行本目录下的程序示例需要使用PaddlePaddle v0.10.0 版本。如果您的PaddlePaddle安装版本低于此要求，请按照[安装文档](http://www.paddlepaddle.org/docs/develop/documentation/zh/build_and_install/pip_install_cn.html)中的说明更新PaddlePaddle安装版本。
+
+---
+
 # Hsigmoid加速词向量训练
 ## 背景介绍
 在自然语言处理领域中，传统做法通常使用one-hot向量来表示词，比如词典为['我', '你', '喜欢']，可以用[1,0,0]、[0,1,0]和[0,0,1]这三个向量分别表示'我'、'你'和'喜欢'。这种表示方式比较简洁，但是当词表很大时，容易产生维度爆炸问题；而且任意两个词的向量是正交的，向量包含的信息有限。为了避免或减轻one-hot表示的缺点，目前通常使用词向量来取代one-hot表示，词向量也就是word embedding，即使用一个低维稠密的实向量取代高维稀疏的one-hot向量。训练词向量的方法有很多种，神经网络模型是其中之一，包括CBOW、Skip-gram等，这些模型本质上都是一个分类模型，当词表较大即类别较多时，传统的softmax将非常消耗时间。PaddlePaddle提供了Hsigmoid Layer、NCE Layer，来加速模型的训练过程。本文主要介绍如何使用Hsigmoid Layer来加速训练，词向量相关内容请查阅PaddlePaddle Book中的[词向量章节](https://github.com/PaddlePaddle/book/tree/develop/04.word2vec)。
diff --git a/image_classification/README.md b/image_classification/README.md
index 45d8ce5742..f041185acc 100644
--- a/image_classification/README.md
+++ b/image_classification/README.md
@@ -1,3 +1,7 @@
+运行本目录下的程序示例需要使用PaddlePaddle v0.11.0 版本。如果您的PaddlePaddle安装版本低于此要求，请按照[安装文档](http://www.paddlepaddle.org/docs/develop/documentation/zh/build_and_install/pip_install_cn.html)中的说明更新PaddlePaddle安装版本。
+
+---
+
 图像分类
 =======================
 
diff --git a/ltr/README.md b/ltr/README.md
index 3cc84494f7..e7ce9f9215 100644
--- a/ltr/README.md
+++ b/ltr/README.md
@@ -1,3 +1,7 @@
+运行本目录下的程序示例需要使用PaddlePaddle v0.10.0 版本。如果您的PaddlePaddle安装版本低于此要求，请按照[安装文档](http://www.paddlepaddle.org/docs/develop/documentation/zh/build_and_install/pip_install_cn.html)中的说明更新PaddlePaddle安装版本。
+
+---
+
 # 排序学习(Learning To Rank)
 
 排序学习技术\[[1](#参考文献1)\]是构建排序模型的机器学习方法，在信息检索、自然语言处理，数据挖掘等机器学场景中具有重要作用。排序学习的主要目的是对给定一组文档，对任意查询请求给出反映相关性的文档排序。在本例子中，利用标注过的语料库训练两种经典排序模型RankNet[[4](#参考文献4)\]和LamdaRank[[6](#参考文献6)\]，分别可以生成对应的排序模型，能够对任意查询请求，给出相关性文档排序。
diff --git a/mt_with_external_memory/README.md b/mt_with_external_memory/README.md
index 413526a5b5..6643b4eb6c 100644
--- a/mt_with_external_memory/README.md
+++ b/mt_with_external_memory/README.md
@@ -1,3 +1,7 @@
+运行本目录下的程序示例需要使用PaddlePaddle v0.11.0 版本。如果您的PaddlePaddle安装版本低于此要求，请按照[安装文档](http://www.paddlepaddle.org/docs/develop/documentation/zh/build_and_install/pip_install_cn.html)中的说明更新PaddlePaddle安装版本。
+
+---
+
 # 带外部记忆机制的神经机器翻译
 
 带**外部记忆**（External Memory）机制的神经机器翻译模型（Neural Machine Translation, NMT），是神经机器翻译模型的一个重要扩展。它引入可微分的记忆网络作为额外的记忆单元，拓展神经翻译模型内部工作记忆（Working Memory）的容量或带宽，辅助完成翻译等任务中信息的临时存取，改善模型表现。
@@ -112,7 +116,7 @@
 算法实现于以下几个文件中：
 
 - `external_memory.py`: 主要实现简化版的 **神经图灵机** 于 `ExternalMemory` 类，对外提供初始化和读写函数。
-- `model.py`: 相关模型配置函数，包括双向 GPU 编码器（`bidirectional_gru_encoder`），带外部记忆强化的解码器（`memory_enhanced_decoder`），带外部记忆强化的序列到序列模型（`memory_enhanced_decoder`）。
+- `model.py`: 相关模型配置函数，包括双向 GPU 编码器（`bidirectional_gru_encoder`），带外部记忆强化的解码器（`memory_enhanced_decoder`），带外部记忆强化的序列到序列模型（`memory_enhanced_seq2seq`）。
 - `data_utils.py`: 相关数据处理辅助函数。
 - `train.py`: 模型训练。
 - `infer.py`: 部分示例样本的翻译（模型推断）。
@@ -166,6 +170,7 @@ class ExternalMemory(object):
                                      a learnable gate function.
         :type enable_interpolation: bool
         """
+        pass
 
     def _content_addressing(self, key_vector):
         """Get write/read head's addressing weights via content-based addressing.
@@ -190,6 +195,7 @@ class ExternalMemory(object):
         :param write_key: Key vector for write heads to generate writing
                           content and addressing signals.
         :type write_key: LayerOutput
+        """
         pass
 
     def read(self, read_key):
@@ -406,7 +412,7 @@ paddle.dataset.wmt14.test(dict_size)
 命令行输入：
 
 ```bash
-python mt_with_external_memory.py
+python train.py
 ```
 或自定义部分参数, 例如:
 
diff --git a/nce_cost/README.md b/nce_cost/README.md
index 1792c41b8d..25864ada5c 100644
--- a/nce_cost/README.md
+++ b/nce_cost/README.md
@@ -1,3 +1,7 @@
+运行本目录下的程序示例需要使用PaddlePaddle v0.10.0 版本。如果您的PaddlePaddle安装版本低于此要求，请按照[安装文档](http://www.paddlepaddle.org/docs/develop/documentation/zh/build_and_install/pip_install_cn.html)中的说明更新PaddlePaddle安装版本。
+
+---
+
 # 使用噪声对比估计加速语言模型训练
 
 ## 为什么需要噪声对比估计
@@ -101,11 +105,11 @@ return paddle.layer.nce(
 
 NCE 层的一些重要参数解释如下：
 
-| 参数名  | 参数作用  | 介绍 |
-|:------ |:-------| :--------|
-| param\_attr / bias\_attr | 用来设置参数名字 |方便预测阶段加载参数，具体在预测一节中介绍。|
-| num\_neg\_samples | 负样本采样个数|可以控制正负样本比例，这个值取值区间为 [1, 字典大小-1]，负样本个数越多则整个模型的训练速度越慢，模型精度也会越高 |
-| neg\_distribution | 生成负样例标签的分布，默认是一个均匀分布| 可以自行控制负样本采样时各个类别的采样权重。例如：希望正样例为“晴天”时，负样例“洪水”在训练时更被着重区分，则可以将“洪水”这个类别的采样权重增加|
+| 参数名                   | 参数作用                                 | 介绍                                                                                                                                                 |
+| :----------------------- | :--------------------------------------- | :--------------------------------------------------------------------------------------------------------------------------------------------------- |
+| param\_attr / bias\_attr | 用来设置参数名字                         | 方便预测阶段加载参数，具体在预测一节中介绍。                                                                                                         |
+| num\_neg\_samples        | 负样本采样个数                           | 可以控制正负样本比例，这个值取值区间为 [1, 字典大小-1]，负样本个数越多则整个模型的训练速度越慢，模型精度也会越高                                     |
+| neg\_distribution        | 生成负样例标签的分布，默认是一个均匀分布 | 可以自行控制负样本采样时各个类别的采样权重。例如：希望正样例为“晴天”时，负样例“洪水”在训练时更被着重区分，则可以将“洪水”这个类别的采样权重增加 |
 
 ## 预测
 1. 在命令行运行 :
diff --git a/nested_sequence/text_classification/README.md b/nested_sequence/text_classification/README.md
index db6f2bc65a..0509ac342b 100644
--- a/nested_sequence/text_classification/README.md
+++ b/nested_sequence/text_classification/README.md
@@ -1,3 +1,7 @@
+运行本目录下的程序示例需要使用PaddlePaddle v0.11.0 版本。如果您的PaddlePaddle安装版本低于此要求，请按照[安装文档](http://www.paddlepaddle.org/docs/develop/documentation/zh/build_and_install/pip_install_cn.html)中的说明更新PaddlePaddle安装版本。
+
+---
+
 # 基于双层序列的文本分类
 
 ## 简介
diff --git a/neural_qa/README.md b/neural_qa/README.md
index 7744493fab..a19d702067 100644
--- a/neural_qa/README.md
+++ b/neural_qa/README.md
@@ -1,3 +1,7 @@
+The minimum PaddlePaddle version needed for the code sample in this directory is v0.10.0. If you are on a version of PaddlePaddle earlier than v0.10.0, [please update your installation](http://www.paddlepaddle.org/docs/develop/documentation/en/build_and_install/pip_install_en.html).
+
+---
+
 # Neural Recurrent Sequence Labeling Model for Open-Domain Factoid Question Answering
 
 This model implements the work in the following paper:
diff --git a/nmt_without_attention/README.md b/nmt_without_attention/README.md
index aad847211d..deb7ff58ee 100644
--- a/nmt_without_attention/README.md
+++ b/nmt_without_attention/README.md
@@ -1,3 +1,7 @@
+The minimum PaddlePaddle version needed for the code sample in this directory is v0.10.0. If you are on a version of PaddlePaddle earlier than v0.10.0, [please update your installation](http://www.paddlepaddle.org/docs/develop/documentation/en/build_and_install/pip_install_en.html).
+
+---
+
 # Neural Machine Translation Model
 
 ## Background Introduction
diff --git a/scene_text_recognition/README.md b/scene_text_recognition/README.md
index 9974d1d74b..f10b4c0d5a 100644
--- a/scene_text_recognition/README.md
+++ b/scene_text_recognition/README.md
@@ -1,3 +1,7 @@
+运行本目录下的程序示例需要使用PaddlePaddle v0.10.0 版本。如果您的PaddlePaddle安装版本低于此要求，请按照[安装文档](http://www.paddlepaddle.org/docs/develop/documentation/zh/build_and_install/pip_install_cn.html)中的说明更新PaddlePaddle安装版本。
+
+---
+
 # 场景文字识别 (STR, Scene Text Recognition)
 
 ## STR任务简介
diff --git a/scheduled_sampling/README.md b/scheduled_sampling/README.md
index 4691c1f8be..2a33f3b248 100644
--- a/scheduled_sampling/README.md
+++ b/scheduled_sampling/README.md
@@ -1,3 +1,7 @@
+运行本目录下的程序示例需要使用PaddlePaddle v0.10.0 版本。如果您的PaddlePaddle安装版本低于此要求，请按照[安装文档](http://www.paddlepaddle.org/docs/develop/documentation/zh/build_and_install/pip_install_cn.html)中的说明更新PaddlePaddle安装版本。
+
+---
+
 # Scheduled Sampling
 
 ## 概述
diff --git a/sequence_tagging_for_ner/README.md b/sequence_tagging_for_ner/README.md
index cea72acc69..9870e3cf2e 100644
--- a/sequence_tagging_for_ner/README.md
+++ b/sequence_tagging_for_ner/README.md
@@ -1,3 +1,7 @@
+运行本目录下的程序示例需要使用PaddlePaddle v0.10.0 版本。如果您的PaddlePaddle安装版本低于此要求，请按照[安装文档](http://www.paddlepaddle.org/docs/develop/documentation/zh/build_and_install/pip_install_cn.html)中的说明更新PaddlePaddle安装版本。
+
+---
+
 # 命名实体识别
 
 以下是本例的简要目录结构及说明：
@@ -21,16 +25,16 @@
 
 命名实体识别（Named Entity Recognition，NER）又称作“专名识别”，是指识别文本中具有特定意义的实体，主要包括人名、地名、机构名、专有名词等，是自然语言处理研究的一个基础问题。NER任务通常包括实体边界识别、确定实体类别两部分，可以将其作为序列标注问题解决。
 
-序列标注可以分为Sequence Classification、Segment Classification和Temporal Classification三类[[1](#参考文献)]，本例只考虑Segment Classification，即对输入序列中的每个元素在输出序列中给出对应的标签。对于NER任务，由于需要标识边界，一般采用[BIO标注方法](http://book.paddlepaddle.org/07.label_semantic_roles/)定义的标签集，如下是一个NER的标注结果示例：
+序列标注可以分为Sequence Classification、Segment Classification和Temporal Classification三类[[1](#参考文献)]，本例只考虑Segment Classification，即对输入序列中的每个元素在输出序列中给出对应的标签。对于NER任务，由于需要标识边界，一般采用[BIO标注方法](http://www.paddlepaddle.org/docs/develop/book/07.label_semantic_roles/index.cn.html)定义的标签集，如下是一个NER的标注结果示例：
 
 <p align="center">
 <img src="images/ner_label_ins.png" width="80%" align="center"/><br/>
 图1. BIO标注方法示例
 </p>
 
-根据序列标注结果可以直接得到实体边界和实体类别。类似的，分词、词性标注、语块识别、[语义角色标注](http://book.paddlepaddle.org/07.label_semantic_roles/index.cn.html)等任务都可通过序列标注来解决。使用神经网络模型解决问题的思路通常是：前层网络学习输入的特征表示，网络的最后一层在特征基础上完成最终的任务；对于序列标注问题，通常：使用基于RNN的网络结构学习特征，将学习到的特征接入CRF完成序列标注。实际上是将传统CRF中的线性模型换成了非线性神经网络。沿用CRF的出发点是：CRF使用句子级别的似然概率，能够更好的解决标记偏置问题[[2](#参考文献)]。本例也将基于此思路建立模型。虽然，这里以NER任务作为示例，但所给出的模型可以应用到其他各种序列标注任务中。
+根据序列标注结果可以直接得到实体边界和实体类别。类似的，分词、词性标注、语块识别、[语义角色标注](http://www.paddlepaddle.org/docs/develop/book/07.label_semantic_roles/index.cn.html)等任务都可通过序列标注来解决。使用神经网络模型解决问题的思路通常是：前层网络学习输入的特征表示，网络的最后一层在特征基础上完成最终的任务；对于序列标注问题，通常：使用基于RNN的网络结构学习特征，将学习到的特征接入CRF完成序列标注。实际上是将传统CRF中的线性模型换成了非线性神经网络。沿用CRF的出发点是：CRF使用句子级别的似然概率，能够更好的解决标记偏置问题[[2](#参考文献)]。本例也将基于此思路建立模型。虽然，这里以NER任务作为示例，但所给出的模型可以应用到其他各种序列标注任务中。
 
-由于序列标注问题的广泛性，产生了[CRF](http://book.paddlepaddle.org/07.label_semantic_roles/index.cn.html)等经典的序列模型，这些模型大多只能使用局部信息或需要人工设计特征。随着深度学习研究的发展，循环神经网络（Recurrent Neural Network，RNN等 序列模型能够处理序列元素之间前后关联问题，能够从原始输入文本中学习特征表示，而更加适合序列标注任务，更多相关知识可参考PaddleBook中[语义角色标注](https://github.com/PaddlePaddle/book/blob/develop/07.label_semantic_roles/README.cn.md)一课。
+由于序列标注问题的广泛性，产生了[CRF](http://www.paddlepaddle.org/docs/develop/book/07.label_semantic_roles/index.cn.html)等经典的序列模型，这些模型大多只能使用局部信息或需要人工设计特征。随着深度学习研究的发展，循环神经网络（Recurrent Neural Network，RNN等 序列模型能够处理序列元素之间前后关联问题，能够从原始输入文本中学习特征表示，而更加适合序列标注任务，更多相关知识可参考PaddleBook中[语义角色标注](https://github.com/PaddlePaddle/book/blob/develop/07.label_semantic_roles/README.cn.md)一课。
 
 ## 模型详解
 
@@ -88,14 +92,14 @@ Baghdad      NNP  I-NP  I-LOC
 预处理完成后，一条训练样本包含3个部分作为神经网络的输入信息用于训练：（1）句子序列；（2）首字母大写标记序列；（3）标注序列，下表是一条训练样本的示例：
 
 | 句子序列 | 大写标记序列 | 标注序列 |
-|---|---|---|
-| u.n. | 1 | B-ORG |
-| official | 0 | O |
-| ekeus | 1 | B-PER |
-| heads | 0 | O |
-| for | 0 | O |
-| baghdad | 1 | B-LOC |
-| . | 0 | O |
+| -------- | ------------ | -------- |
+| u.n.     | 1            | B-ORG    |
+| official | 0            | O        |
+| ekeus    | 1            | B-PER    |
+| heads    | 0            | O        |
+| for      | 0            | O        |
+| baghdad  | 1            | B-LOC    |
+| .        | 0            | O        |
 
 ## 运行
 ### 编写数据读取接口
diff --git a/sequence_tagging_for_ner/images/BIO tag example.png b/sequence_tagging_for_ner/images/BIO tag example.png
new file mode 100644
index 0000000000..88ee9e84b7
Binary files /dev/null and b/sequence_tagging_for_ner/images/BIO tag example.png differ
diff --git a/sequence_tagging_for_ner/images/ner_model_en.png b/sequence_tagging_for_ner/images/ner_model_en.png
new file mode 100644
index 0000000000..da541cda7e
Binary files /dev/null and b/sequence_tagging_for_ner/images/ner_model_en.png differ
diff --git a/ssd/README.cn.md b/ssd/README.cn.md
index b514418205..2e510908a4 100644
--- a/ssd/README.cn.md
+++ b/ssd/README.cn.md
@@ -1,3 +1,7 @@
+运行本目录下的程序示例需要使用PaddlePaddle v0.10.0 版本。如果您的PaddlePaddle安装版本低于此要求，请按照[安装文档](http://www.paddlepaddle.org/docs/develop/documentation/zh/build_and_install/pip_install_cn.html)中的说明更新PaddlePaddle安装版本。
+
+---
+
 # SSD目标检测
 ## 概述
 SSD全称：Single Shot MultiBox Detector，是目标检测领域较新且效果较好的检测算法之一\[[1](#引用)\]，有着检测速度快且检测精度高的有的。PaddlePaddle已集成SSD算法，本示例旨在介绍如何使用PaddlePaddle中的SSD模型进行目标检测。下文首先简要介绍SSD原理，然后介绍示例包含文件及如何使用，接着介绍如何在PASCAL VOC数据集上训练、评估及检测，最后简要介绍如何在自有数据集上使用SSD。
diff --git a/ssd/README.md b/ssd/README.md
index 99856a69d2..22ac492f49 100644
--- a/ssd/README.md
+++ b/ssd/README.md
@@ -1,3 +1,7 @@
+The minimum PaddlePaddle version needed for the code sample in this directory is v0.10.0. If you are on a version of PaddlePaddle earlier than v0.10.0, [please update your installation](http://www.paddlepaddle.org/docs/develop/documentation/en/build_and_install/pip_install_en.html).
+
+---
+
 # Single Shot MultiBox Detector (SSD) Object Detection
 
 ## Introduction
diff --git a/text_classification/README.md b/text_classification/README.md
index 191ab20f2e..0617e19d30 100644
--- a/text_classification/README.md
+++ b/text_classification/README.md
@@ -1,3 +1,7 @@
+运行本目录下的程序示例需要使用PaddlePaddle v0.10.0 版本。如果您的PaddlePaddle安装版本低于此要求，请按照[安装文档](http://www.paddlepaddle.org/docs/develop/documentation/zh/build_and_install/pip_install_cn.html)中的说明更新PaddlePaddle安装版本。
+
+---
+
 # 文本分类
 
 以下是本例目录包含的文件以及对应说明:
@@ -129,70 +133,70 @@ negative        0.0300 0.9700   i love scifi and am willing to put up with a lot
 
 1. 数据组织
 
-	假设有如下格式的训练数据：每一行为一条样本，以 `\t` 分隔，第一列是类别标签，第二列是输入文本的内容，文本内容中的词语以空格分隔。以下是两条示例数据：
+    假设有如下格式的训练数据：每一行为一条样本，以 `\t` 分隔，第一列是类别标签，第二列是输入文本的内容，文本内容中的词语以空格分隔。以下是两条示例数据：
 
-	```
-	positive        PaddlePaddle is good
-	negative        What a terrible weather
-	```
+    ```
+    positive        PaddlePaddle is good
+    negative        What a terrible weather
+    ```
 
 2. 编写数据读取接口
 
-	自定义数据读取接口只需编写一个 Python 生成器实现**从原始输入文本中解析一条训练样本**的逻辑。以下代码片段实现了读取原始数据返回类型为： `paddle.data_type.integer_value_sequence`（词语在字典的序号）和 `paddle.data_type.integer_value`（类别标签）的 2 个输入给网络中定义的 2 个 `data_layer` 的功能。
-	```python
-	def train_reader(data_dir, word_dict, label_dict):
-	    def reader():
-	        UNK_ID = word_dict["<UNK>"]
-	        word_col = 0
-	        lbl_col = 1
-
-	        for file_name in os.listdir(data_dir):
-	            with open(os.path.join(data_dir, file_name), "r") as f:
-	                for line in f:
-	                    line_split = line.strip().split("\t")
-	                    word_ids = [
-	                        word_dict.get(w, UNK_ID)
-	                        for w in line_split[word_col].split()
-	                    ]
-	                    yield word_ids, label_dict[line_split[lbl_col]]
-
-	    return reader
-	```
-
-	- 关于 PaddlePaddle 中 `data_layer` 接受输入数据的类型，以及数据读取接口对应该返回数据的格式，请参考 [input-types](http://www.paddlepaddle.org/release_doc/0.9.0/doc_cn/ui/data_provider/pydataprovider2.html#input-types) 一节。
-	- 以上代码片段详见本例目录下的 `reader.py` 脚本，`reader.py` 同时提供了读取测试数据的全部代码。
-
-	接下来，只需要将数据读取函数 `train_reader` 作为参数传递给 `train.py` 脚本中的 `paddle.batch` 接口即可使用自定义数据接口读取数据，调用方式如下：
-
-	```python
-	train_reader = paddle.batch(
-	        paddle.reader.shuffle(
-	            reader.train_reader(train_data_dir, word_dict, lbl_dict),
-	            buf_size=1000),
-	        batch_size=batch_size)
-	```
+    自定义数据读取接口只需编写一个 Python 生成器实现**从原始输入文本中解析一条训练样本**的逻辑。以下代码片段实现了读取原始数据返回类型为： `paddle.data_type.integer_value_sequence`（词语在字典的序号）和 `paddle.data_type.integer_value`（类别标签）的 2 个输入给网络中定义的 2 个 `data_layer` 的功能。
+    ```python
+    def train_reader(data_dir, word_dict, label_dict):
+        def reader():
+            UNK_ID = word_dict["<UNK>"]
+            word_col = 0
+            lbl_col = 1
+
+            for file_name in os.listdir(data_dir):
+                with open(os.path.join(data_dir, file_name), "r") as f:
+                    for line in f:
+                        line_split = line.strip().split("\t")
+                        word_ids = [
+                            word_dict.get(w, UNK_ID)
+                            for w in line_split[word_col].split()
+                        ]
+                        yield word_ids, label_dict[line_split[lbl_col]]
+
+        return reader
+    ```
+
+    - 关于 PaddlePaddle 中 `data_layer` 接受输入数据的类型，以及数据读取接口对应该返回数据的格式，请参考 [input-types](http://www.paddlepaddle.org/release_doc/0.9.0/doc_cn/ui/data_provider/pydataprovider2.html#input-types) 一节。
+    - 以上代码片段详见本例目录下的 `reader.py` 脚本，`reader.py` 同时提供了读取测试数据的全部代码。
+
+    接下来，只需要将数据读取函数 `train_reader` 作为参数传递给 `train.py` 脚本中的 `paddle.batch` 接口即可使用自定义数据接口读取数据，调用方式如下：
+
+    ```python
+    train_reader = paddle.batch(
+            paddle.reader.shuffle(
+                reader.train_reader(train_data_dir, word_dict, lbl_dict),
+                buf_size=1000),
+            batch_size=batch_size)
+    ```
 
 3. 修改命令行参数
 
-	- 如果将数据组织成示例数据的同样的格式，只需在 `run.sh` 脚本中修改 `train.py` 启动参数，指定 `train_data_dir` 参数，可以直接运行本例，无需修改数据读取接口 `reader.py`。
-	- 执行 `python train.py --help` 可以获取`train.py` 脚本各项启动参数的详细说明，主要参数如下：
-		- `nn_type`：选择要使用的模型，目前支持两种：“dnn” 或者 “cnn”。
-		- `train_data_dir`：指定训练数据所在的文件夹，使用自定义数据训练，必须指定此参数，否则使用`paddle.dataset.imdb`训练，同时忽略`test_data_dir`，`word_dict`，和 `label_dict` 参数。  
-		- `test_data_dir`：指定测试数据所在的文件夹，若不指定将不进行测试。
-		- `word_dict`：字典文件所在的路径，若不指定，将从训练数据根据词频统计，自动建立字典。
-		- `label_dict`：类别标签字典，用于将字符串类型的类别标签，映射为整数类型的序号。
-		- `batch_size`：指定多少条样本后进行一次神经网络的前向运行及反向更新。
-		- `num_passes`：指定训练多少个轮次。
+    - 如果将数据组织成示例数据的同样的格式，只需在 `run.sh` 脚本中修改 `train.py` 启动参数，指定 `train_data_dir` 参数，可以直接运行本例，无需修改数据读取接口 `reader.py`。
+    - 执行 `python train.py --help` 可以获取`train.py` 脚本各项启动参数的详细说明，主要参数如下：
+        - `nn_type`：选择要使用的模型，目前支持两种：“dnn” 或者 “cnn”。
+        - `train_data_dir`：指定训练数据所在的文件夹，使用自定义数据训练，必须指定此参数，否则使用`paddle.dataset.imdb`训练，同时忽略`test_data_dir`，`word_dict`，和 `label_dict` 参数。  
+        - `test_data_dir`：指定测试数据所在的文件夹，若不指定将不进行测试。
+        - `word_dict`：字典文件所在的路径，若不指定，将从训练数据根据词频统计，自动建立字典。
+        - `label_dict`：类别标签字典，用于将字符串类型的类别标签，映射为整数类型的序号。
+        - `batch_size`：指定多少条样本后进行一次神经网络的前向运行及反向更新。
+        - `num_passes`：指定训练多少个轮次。
 
 ### 如何预测
 
 1. 修改 `infer.py` 中以下变量，指定使用的模型、指定测试数据。
 
-	```python
-	model_path = "dnn_params_pass_00000.tar.gz"  # 指定模型所在的路径
-	nn_type = "dnn"      # 指定测试使用的模型
-	test_dir = "./data/test"      # 指定测试文件所在的目录
-	word_dict = "./data/dict/word_dict.txt"     # 指定字典所在的路径
-	label_dict = "./data/dict/label_dict.txt"    # 指定类别标签字典的路径
-	```
+    ```python
+    model_path = "dnn_params_pass_00000.tar.gz"  # 指定模型所在的路径
+    nn_type = "dnn"      # 指定测试使用的模型
+    test_dir = "./data/test"      # 指定测试文件所在的目录
+    word_dict = "./data/dict/word_dict.txt"     # 指定字典所在的路径
+    label_dict = "./data/dict/label_dict.txt"    # 指定类别标签字典的路径
+    ```
 2. 在终端中执行 `python infer.py`。
diff --git a/youtube_recall/README.cn.md b/youtube_recall/README.cn.md
index 2f20416bb9..6628a6269b 100644
--- a/youtube_recall/README.cn.md
+++ b/youtube_recall/README.cn.md
@@ -1,3 +1,7 @@
+运行本目录下的程序示例需要使用PaddlePaddle v0.10.0 版本。如果您的PaddlePaddle安装版本低于此要求，请按照[安装文档](http://www.paddlepaddle.org/docs/develop/documentation/zh/build_and_install/pip_install_cn.html)中的说明更新PaddlePaddle安装版本。
+
+---
+
 # Youtube DNN推荐模型
 
 以下是本例目录包含的文件以及对应说明:
diff --git a/youtube_recall/README.md b/youtube_recall/README.md
index b67bd33660..b9912abeb8 100644
--- a/youtube_recall/README.md
+++ b/youtube_recall/README.md
@@ -1,3 +1,7 @@
+The minimum PaddlePaddle version needed for the code sample in this directory is v0.10.0. If you are on a version of PaddlePaddle earlier than v0.10.0, [please update your installation](http://www.paddlepaddle.org/docs/develop/documentation/en/build_and_install/pip_install_en.html).
+
+---
+
 # Deep Neural Networks for YouTube Recommendations
 
 ## Introduction\[[1](#References)\]