Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New API Design Doc #1297

Merged
merged 9 commits into from Feb 13, 2017
Merged

Conversation

wangkuiyi
Copy link
Collaborator

No description provided.


Optimizer = {Model*, Evaluator*, GradientMachine*}
- train
- update
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think update don't need to be exposed publicly (only used by train).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to use composition other than derivation, then I would agree that we should have Optimizer, which has the train method, and Updater, which has the update method. But as Python supports derivation, we can create AdamOptimizer basing on SGDOptimizer by simply overriding Optimizer::update.

Optimizer = {Model*, Evaluator*, GradientMachine*}
- train
- update
- checkpoint
Copy link
Contributor

@helinwang helinwang Feb 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this checkpoint method mean saving everything (model topology, parameters, optimizer states)?

I think checkpoint should be called internally from train, and not exposed.
Instead, expose

sgd.save() # save only internal state of optimizer
paddle.optimizer.SGD(serialized_state) # create SGD optimizer from serialized_state

In this way everything is orthogonal (before optimizer.checkpoint includes model.save), user will have option to save and load whatever they want (typically when they manually implement train). By doing:

# save
model.save(path)
sgd_state = sgd.save()

# load
model.load(path)
sgd = paddle.optimizer.SGD(sgd_state)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, i have no much idea about checkpointing. We might consult other team member for what checkpointing needs to do.

learning and try to figure their relationship, as shown below:

```
Model = {topology, parameters}
Copy link
Contributor

@helinwang helinwang Feb 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment for future when we will add bindings for more languages (or even now, if people use python to create topology and train with c++)

Now model parameter is initialized during model creation: model = paddle.model.create().
model.save() will always save both parameter and topology, and model.load() will load both as well.
We may need a method model.init() to initialize model parameters again.

The use case is: When user create topology from python (because it's easier), and use it in C++ (or golang in future). User would like to have the option to init parameter again in order to start training from somewhere else in the parameter space.

From the use case above, we can see that topology and parameter is not tightly coupled. Maybe we can consider instead expose

save_topology()
save_parameter()
load_topology()
load_parameter()

Copy link
Collaborator Author

@wangkuiyi wangkuiyi Feb 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "creating topology using Python and run training using C++/Go" is not a common case, but due to that Tensorflow doesn't provide a language binding that works both good for specifying topology and for training. In other words, I don't see a chance that we need to load only the topology, or only parameters, but not both.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe serialize, deserialize is better.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should decouple network topology and parameter like @helinwang said. Another reason is a model's parameter could be reused by another model. So we should give users an interface to save/load topology only.

So I think we can get each parameter and network topology by a model. To save a model is just serialize them one by one, and the user could customize the whole process for serialization.

The user could also manipulate every parameter of this model by using get_parameter, like param = model.get_param(); param.randomize().

But network topology seems to be static or read-only after creating a model. The user cannot change network_topology when a model created because all parameters configuration will be changed when network_topology is changed.

class Model(object):
    def __init__(self, network_topology):
        self.__network_topology__ = network_topology

    def get_parameter(self, param_name):
        ...

    def get_topology(self):
        ...

    def serialize(self, stream):
        stream.write("Model File Header")
        self.get_topology().serialize(stream)
        self.get_parameter().serialize(stream)

    def deserialize(self, stream):
        stream.read("Model File Header")
        self.get_topology().deserialize(stream)
        self.get_parameter().deserialize(stream)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add a concept of Cost, there are several kind of cost, and different cost may compose to a new kind of Cost.

Copy link
Collaborator Author

@wangkuiyi wangkuiyi Feb 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果考虑“a model's parameter could be reused by another model.”,那么需要对reuse的过程有所理解。可否写一段伪码试试,以便理解这个问题?

我试了一下:

in = layer.data(...)
int = layer.fc(in, ..., parameter_name="proj")
out = layer.softmax(int, ...)
m = model.create(out)
m.parameter("proj").deserialize(stream)

显然有几个问题:

  1. stream 从哪儿来?
  2. model的其他参数从哪儿来?

如果把这两个问题及其延伸问题都考虑清楚,并且描述清楚,那么这个文档又要增长多长?我们需要多少时间来实现?对第一期目标“让目前八章book更简短”有什么帮助?

目前我们讨论的是个框架,就像Unix一开始的功能设计(文件是字符流,设备用文件系统里的inode表示)。当年的Unix并没有现在这么强大,但是当年的核心思路设计是可以容纳未来新增的更多的功能的,所以那就是一个好的设计了。

想到新的功能不难,知道可以不用做什么才是真功夫。基于目前的设计,复用参数的功能是不是日后可以再加上去的?

GradientMachine = {Evaluator*, gradients}
- backward

Optimizer = {Model*, Evaluator*, GradientMachine*}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optimizer = {GradientMachine*}

GradientMacine -> Evaluator -> Model

learning and try to figure their relationship, as shown below:

```
Model = {topology, parameters}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe serialize, deserialize is better.

learning and try to figure their relationship, as shown below:

```
Model = {topology, parameters}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should decouple network topology and parameter like @helinwang said. Another reason is a model's parameter could be reused by another model. So we should give users an interface to save/load topology only.

So I think we can get each parameter and network topology by a model. To save a model is just serialize them one by one, and the user could customize the whole process for serialization.

The user could also manipulate every parameter of this model by using get_parameter, like param = model.get_param(); param.randomize().

But network topology seems to be static or read-only after creating a model. The user cannot change network_topology when a model created because all parameters configuration will be changed when network_topology is changed.

class Model(object):
    def __init__(self, network_topology):
        self.__network_topology__ = network_topology

    def get_parameter(self, param_name):
        ...

    def get_topology(self):
        ...

    def serialize(self, stream):
        stream.write("Model File Header")
        self.get_topology().serialize(stream)
        self.get_parameter().serialize(stream)

    def deserialize(self, stream):
        stream.read("Model File Header")
        self.get_topology().deserialize(stream)
        self.get_parameter().deserialize(stream)


Evaluator = {Model*, activations}
- forward
- test(cost, ...)
Copy link
Collaborator

@reyoung reyoung Feb 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should Evaluator's API only contains forward method?

result = evaluator.forward(input_data, cost=CrossEntropy(input_layer='predict', label=label_data), metric=[
     ErrorRate(input_layer='predict', label=label_data), RecallRate(input_layer='predict', label=label_data)
], begin=None, end=None, drop_previous_activation=False)

print result.cost
print result.metrics[0]  # Error rate
print result.metrics[1]  # recall
print result.activation("predict") # get predict layer's activation.

Reasons are:

  • During the training process, the forward method should add cost function, and during the model inference, the cost and metric are not necessary. So forward should at least contain cost argument.

  • To monitor training error rate is needed, so adding a metric argument in forward method will help us to monitor training error rate, etc.

  • Add drop_previous_activation argument because when only do model inference or testing job, the middle layers activation could be dropped for reducing memory usage. Only drop_previous_activation=False, this forward could be backward.

  • Add begin, end argument because

    • We could only forward a sub-part of a model. Change begin or end for a forwarding could change the input data. Like the network below, when C is needed, the input of D is not necessary. So specify end is useful.
    A --> B --> C
                 \
                  -->G -->H -->I
                 /
    D --> E --> F
    
    • In beam search method, a very common tree search method used in RNN generative model inference, we should forward a subset of an RNN many times. Each node of the searching tree could be a subset of RNN forwarding state(activations). So begin is very useful.
  • And it seems that the activation should be part of the forward method's result, not evaluator's data member.

Copy link
Collaborator

@reyoung reyoung Feb 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even more aggressively, Is Evaluator a necessary concept for Paddle?

It seems if we think the activation should be the result of forwarding a neural network, Evaluator should store nothing. The global forward method is enough? Or forward method should be a part of neural network model?

And if the forward method is a part of Model. Model should be bound to a device(a GPU card, or all CPUs). The forward method must be performed on a device. In CPU, it doesn't matter because for every thread we could forwarding same model. But, when Model in one GPU card, we cannot forward it by another GPU card.

So Model's constructor could be

class Model(object):
    # when device == -1, use CPU
    # when device == other, use GPU(device).
    def __init__(topology,  device=-1):
      pass

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里用户接口层面的Model增加device属性,可能会给使用者带来一些困惑。不一定所有的人都理解But, when Model in one GPU card, we cannot forward it by another GPU card.,那么当使用者构造了一个Model(device=0)的model,但在device=1上执行失败,就会比较费解。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reyoung Good comments! For the "add begin and end" part, I don't think begin is necessary? Because the dependency is fully specified when we want to calculate end.

Copy link
Contributor

@helinwang helinwang Feb 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the "Is Evaluator a necessary concept for Paddle?". I think we need some object for caching intermediate tensors.

From what I understand (asked this question in yesterday's meeting), paddle don't release intermediate tensors after a forward pass even only for inference. I think that is reasonable. Since malloc and free takes time. If we don't release intermediate tensors, but reusing them by overriding, we can save forward time. (at least, release or not could be tunable argument).

So if we don't release, we need an object to save the states(all intermediate tensors). Evaluator could be such an object.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't totally understand why "Model should be bound to a device(a GPU card, or all CPUs)". I think parameters should be place by paddle automatically. Paddle should try it's best to place all parameters on GPU. The only case it fails to do so is when a layer don't have GPU implementation. So parameter will be allocated on CPU memory. And input tensor for that layer will be copied from GPU to CPU. output tensor will be copied from CPU to GPU.

It will be same for multi-GPU case, model is replicated n times where n is number of GPUs. So when initializing Evaluator and Gradient Machine, constructor parameters need to include GPU device ids as a list, len(list) will be how many model replicas we have.

Copy link
Collaborator Author

@wangkuiyi wangkuiyi Feb 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我也觉得model里不应该指定device。我的考虑是现代分布式计算里,在哪个设备上执行某个操作应该是调度系统(Linux kernel、Kubernetes scheduler、PaddlePaddle framework)自动决定的,而不是用户指定的。我以为,沿着这个思路,我们能做出一个真正“易用”的DL系统,和目前很多学术界做出来的系统有所区别。

Copy link
Contributor

@helinwang helinwang Feb 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我觉得model可以是这样

model
|-topology
|-parameters
   |-parameter1
      |-parameter1-handle-on-GPU1
      |-parameter1-handle-on-GPU2
   |-parameter2 (这个layer只有CPU实现)
      |-parameter2-handle-on-CPU
   |-...

所以不需要给model指定device,只用Evaluator和Gradient Machine跟model说哪个parameter对应的内存分配在哪里了。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

沿着 @helinwang 说的,进一步考虑:我认为一个model里所有parameters都放在一个gpu的显卡里就好了(或者在没有gpu的时候都放在一台机器的内存里)。至少项目的第一阶段如此。一个复杂的设计,很容易让我们丧失对项目的把控,即便能带来性能的优化,也往往得不偿失。

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我认为一个model里所有parameters都放在一个gpu的显卡里就好了(或者在没有gpu的时候都放在一台机器的内存里)

同意。

可以不用太在意那个aggressively的想法。如果我们对整体概念没有巨大分歧的话,是否可以先讲design doc merge掉,我们开始撸起袖子写接口了?

- test(cost, ...)

GradientMachine = {Evaluator*, gradients}
- backward
Copy link
Collaborator

@reyoung reyoung Feb 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe gradient_machine's API could be

result = gradient_machine.forward_backward(..., on_parameter_backwarded=None)  # basically same arguments in `Evaluator.forward`

The details show below.

  • on_parameter_backwarded will be invoked when the parameter's gradient is calculated immediately. This callback is very useful for network communication. We could send some parameter's gradient while other gradients are calculating.
  • Combine forward and backward into one method, because it is not necessarily two stage, especially for multi-threading calculation.
    • The multi-thread backward could not wait for all thread forwarding complete. Each thread could do forward-backward individually. So combining forward-backward method together could let us have a chance to optimize multi-thread training.
    • Not every forwarding can be backward. We must record all activations for backward. But for model inference, it only need to get the final output, intermediate activations could be dropped for saving memory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

嗯,我们如果提供的是一个封装了多CPU/GPU/多机这样的梯度计算接口,那么这里应该是一个forward_backward更适合,可以隐藏通信这些信息。

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

恩,另外在我的最激进的想象里(可以忽略,这个想法没有仔细想过性能问题),这个forward_backward可能也是纯粹实现在Python里面。

例如,对于同时使用12个CPU和4个GPU的gradient_machine。实现起来的伪代码可能是

class GradientMachine():
    def __init__(self, model):
        self.origin_model = model
        self.cpu_models = []
        self.gpu_models = []
        for i in xrange(12):
            self.cpu_models.append(paddle.model.shared_model(model))

        for i in xrange(4):
            self.gpu_models.append(paddle.model.copy_model(model, device=i))

    def forward_backward(data):
        datas = data.split(len(self.cpu_models) + len(self.gpu_models))
        result = paddle.thread_pool.map(lambda model, data: model.forward_backward(model, data), 
                                        zip([self.cpu_models, self.gpu_models], datas))
        gpu_gradient = paddle.gpu.merge_gradient(self.gpu_models)
        cpu_gradient = self.origin_model.gradient 
        cpu_gradient += gpu_gradient
        result.gradient = cpu_gradient
        return result

Copy link
Contributor

@helinwang helinwang Feb 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

你说的是不是

// thread
forward()
sync_all()
backward()
sync_all()
update_parameters()

所以需要forward_backward这个函数。

我觉得不一定需要forward_backwardforward之后可以立刻backward呀。
比如:

// thread
forward()
backward()
sync_all()
update_parameters()

- backward

Optimizer = {GradientMachine*}
- train(cost, ...)
Copy link
Collaborator

@reyoung reyoung Feb 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def train_reader():
    yield {'pixel': pixels, 'label': labels}  # return a data batch.

# Observe callback is used for plotting or logging the training process.
# The type of event parameter could be various. The intermediate result for 
# training is in event instance.
def callback(event):
     if isinstance(event, FinishTrainOneBatch):
        print event.pass_id, event.batch_id, "Cost = ", event.cost, "Error Rate = ", event.metric[0]
        print "output layer's output is ", event.activation['output']
        if event.batch_id % 1000 == 0:  # Even, we could save check point during callback.
            with open('check_point_%d' % event.batch_id, 'w') as stream:
                 optimizer.check_point(stream)
     else:
        pass

optimizer.train(train_reader=train_reader,  test_reader=None,  # Test reader shared the same 
                                                               # format of train reader. Could be None if no test data.
                cost=CrossEntropy(input=model.topology.output_layer,  # the network's output layer.
                                  label=DataReader("label")),  # Label is get from data_reader's 'label' field.
                metric=[ErrorRateMetric(input=model.topology.output_layer, label=DataReader("label"))], # same logic above
                observe_callback=callback
)


Evaluator = {Model*, activations}
- forward
- test(cost, ...)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jacquesqiao #1297 (comment) 里提出的是个好问题。大家怎么看?

Copy link
Contributor

@helinwang helinwang Feb 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

龙飞说的"should we add a concept of Cost, there are several kind of cost, and different cost may compose to a new kind of Cost."很有道理。以下是我的理解:

我觉得类型上,cost应该就是一个layer。任何一个输出是scalar的layer都可以作为cost。因为我们做backprop的时候,是把那个layer的输出设成1来做的,这里需要且仅需要那一个layer的输出是一个scalar。

使用方法/概念上,可以把cost layer放在模型外

forward(cost=cross_entropy(y,y_)) // or test(cross_entropy(y,y_))
backward(cross_entropy(y,y_))

或者模型内

forward(cross_entropy_loss) // 完全不需要test(cross_entropy(y,y_)),因为都是layer
backward(cross_entropy_loss)

从模型简单,可扩展(比如说load一个topology,换cost)来看,我觉得前者好。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我也理解cost是layer。

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cost是一个Layer,或者更简单的,可以是Python端定义的一个函数。

与此同时,cost是目前forward的一个参数。也就是这个Layer是在forward的时候,attach到模型(Model)上的。

Model = {topology, parameters}

Evaluator = {Model*, activations}
- forward
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reyounghttps://github.com/PaddlePaddle/Paddle/pull/1297/files#r100475149 里提出的问题,我的看法是:

在 trianing 的时候forward肯定得保存activations为class member,因为backward需要使用。

即便在 serving 的时候,我们也不能假设只需要保存输出层的activations —— 万一用户想visualize中间某层的输出呢?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1、result.activation保存了所有层的activation。所以,backward和用户需要可视化的时候,都可以使用。
2、将activation返回,也是为了让forward这个函数可以变得『无状态』,也就是一个纯粹的函数。这样做的好处是在某些类型的任务中(例如机器翻译,这个是前8章的内容),预测会多次调用forward函数,并且forward函数需要在神经网络的中间开始调用(也就是神经网络的Activations需要被Get/Set)。这如果将activation保存到 data member里面,会让这个过程变得略微困难一点。

@reyoung
Copy link
Collaborator

reyoung commented Feb 12, 2017

有一个问题需要周知一下,就是神经网络的预测,不一定是只调用一次forward就搞定了。很多模型中,会对forward的结果,神经网络的中间结点进行修正。以PaddlePaddle/book中的机器翻译举例如下:

image

机器翻译训练的是一个生成模型,他使用一个生成式的RNN作为decoder。这个docoder实际上会连续的调用一个以目标语言词表为宽度的softmax函数。生成目标语言里面每个单词的概率。训练的时候,使用贪心法训练,每一步都取这个目标语言词表里最大概率的值,作为这个RNN下一个时间步的输入。即上图。

但是在预测的时候,每一个时间步,不会仅仅再每一步取最大的概率值(不会有ArgMax那个节点),而是采取一个更优化的树搜索策略(类似于剪枝的BFS, BeamSearch)。因为贪心法肯定不是最优解。

于是,我们会在ArgMax那个节点,分出N多岔路。并且会在每一个岔路里面,只再forward ArgMax右侧的部分节点。

这种需求,要求我们的forward函数是,可以从网络的中间部分开始forward,并且forward一部分网络。

Copy link
Collaborator

@reyoung reyoung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我觉得,如果没有对概念上的大规模修正,设计文档可以先merged,我们可以先定API接口,完成一个Demo。

@wangkuiyi
Copy link
Collaborator Author

wangkuiyi commented Feb 13, 2017

Just to mark that this is an incomplete design. To make step-by-step progress, I am following @reyoung 's suggestion and merging this PR. But there are more to be discussed as in #1315. I am opening a new PR for that.

@wangkuiyi wangkuiyi merged commit 0e6a11a into PaddlePaddle:develop Feb 13, 2017
@wangkuiyi wangkuiyi deleted the design_doc_new_api branch February 13, 2017 04:59
zhhsplendid pushed a commit to zhhsplendid/Paddle that referenced this pull request Sep 25, 2019
* Rewrite firstn and shuffle functions, test=develop
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants