Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge with lasted develop branch. Optimizer lib2 #2386

Merged
merged 35 commits into from
Jun 20, 2017

Conversation

dzhwinter
Copy link
Contributor

It is an indenpendent module which will be used by pserver by cgo call. It is easier to update the opitmizer library and made this pull request that resolve the conflict between develop branch and #2190

@helinwang helinwang mentioned this pull request Jun 5, 2017
add_library(optimizer STATIC ${OPITMIZER_SRCS})
add_dependencies(optimizer gen_proto_cpp)

add_simple_unittest(Tensor_test)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use all low case for filenames (tensor_test.cpp instead of Tensor_test.cpp). Some filesystem is not case sensitive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix done

return data_[idx];
}
// TODO: replace with tensorshape
size_t size() const { return this->width_; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since height is already a member variable, I think you implemented a 2-d tensor, so here should be return this->width_ * this->height;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be. fix done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be. fixed done.

};

// TODO(zhihong): design problem of dynamic datatype, need to fix it
typedef TensorT<real> Tensor;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not know whether real means float or double. Doing a Google search does not answer my question. Maybe for clarity, we need to change all real in this PR to float? (since currently we are working with 32 bit floating point parameters.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

real type is a macro type in the whole project. https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/utils/Common.h#L30
which is determined when the project is compiled. It is not a good choice since our type is passed by caller's data type, which is determined at runtime. Need to be fixed.

Copy link
Contributor Author

@dzhwinter dzhwinter Jun 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me explain this part more clear.
In the last PR, I just implement Tensor with template class, but it is the wrong way. Because template class will be lead to some specific types in compile period( explain the reason in a meeting will be better).
In fact, tensor should represent some sort of consistent memory, and support dynamic type in runtime
e.g. https://github.com/caffe2/caffe2/blob/master/caffe2/core/tensor.h#L643
just like an enhanced any type in boost library, we can merge this PR, and enrich tensor library in the future. At this moment, we only support compile period datatype here.

Copy link
Contributor

@helinwang helinwang Jun 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that's fine. Can you use float instead of real? Because parameter is supposed to be a general data store, should not be dependent on how real is defined on Paddle.

After this PR we need to work on another PR to add tensors of required data types.

MATCH_ENUM_TYPE(int32_t, PADDLE_ELEMENT_TYPE_INT32);
MATCH_ENUM_TYPE(uint32_t, PADDLE_ELEMENT_TYPE_UINT32);
MATCH_ENUM_TYPE(int64_t, PADDLE_ELEMENT_TYPE_INT64);
MATCH_ENUM_TYPE(uint64_t, PADDLE_ELEMENT_TYPE_UINT64);
// only below is implemented, we need to implement other types in a follow up PR.
MATCH_ENUM_TYPE(float, PADDLE_ELEMENT_TYPE_FLOAT32);
MATCH_ENUM_TYPE(double, PADDLE_ELEMENT_TYPE_FLOAT64);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, fix done.

using namespace paddle::optimizer;

TEST(Tensor, indexer) {
real* ptr = new real[3];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to release the ptr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed done.

namespace paddle {
namespace optimizer {

void AdadeltaOptimizer::set_weight(Tensor* p) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the content of p is no used in function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix done

accum_gradient = new Tensor(gptr, size);
}

void AdagradOptimizer::update(const Tensor* gradient) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to follow Google C++ code style for C++ function names: https://google.github.io/styleguide/cppguide.html#Function_Names

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix done.

required string lr_policy = 11;
optional ConstLr const_lr = 12;
optional LinearLr linear_lr = 15;
optional uint64 num_sample_passed = 13 [default = 0];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this is part of the optimizer config?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be removed, fix done.

Tensor *parameter_;

// learning rate policy
BaseLr *lr_policy;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should follow Google C++ style for class member variable naming: https://google.github.io/styleguide/cppguide.html#Variable_Comments

// lr_policy , type : string
// ConstLr = 0;
// LinearLr = 1;
required string lr_policy = 11;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe enum rather than string is more appropriate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

// Adadelta = 2;
// Adagrad = 3;
// Adam = 4;
required string optimizer_name = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe enum is better than string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rewrite it. done.

@@ -0,0 +1,49 @@
#ifndef PADDLE_OPTIMIZER_TENSOR_H_
Copy link
Member

@jacquesqiao jacquesqiao Jun 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PaddlePaddle use #pragma once

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will change that and follow the google c++ style.
https://google.github.io/styleguide/cppguide.html#The__define_Guard

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a formal discussion, we decided to use #pragma once(https://github.com/PaddlePaddle/cpp-primer-digest). So we can first change the digest then change our code style.

@@ -0,0 +1,36 @@
#ifndef PADDLE_OPTIMIZER_SERIALIZARION_H
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感觉一个pr里边是不是不要放太多东西比较好?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed. But since this is almost a finished PR, add simple serialization feature for saving checkpoint. Which will be developed into a new PR alone.

namespace paddle {
namespace optimizer {

void AdadeltaOptimizer::set_weight(Tensor* p) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need Google style function names. Please replace all C++ function names with CamelCase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://google.github.io/styleguide/cppguide.html#Function_Names

Accessors and mutators (get and set functions) may be named like variables. These often correspond to actual member variables, but this is not required. For example, int count() and void set_count(int count).
since this function modify the parameter member of the optimizer, in my mind, it is an accessor?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, sorry!

void AdadeltaOptimizer::set_weight(Tensor* p) {
parameter_ = p;
size_t size = p->size();
accum_gradient_ = new Tensor(size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we free previous accum_gradient_, accum_delta_, update_delta_ if not NULL? Same for other set_weight functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my mind, an optimizer instance is mapping to a parameter, we never reuse the instance, even though restarting from a checkpoint. In such a situation, we should not implement with pointer guard.

Copy link
Contributor

@helinwang helinwang Jun 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, the contract should be API, not an oral or documentation contract like "we never reuse the instance". An oral contract does not prevent client user from reusing the instance. If we have a oral contract, people could easily forget. If we have a documentation contract, it could be easily get out of date, or many people will not look at the document.
A good way to prevent user to reuse the instance is to through API: remove set_weight, and put weight initialization into constructor.
If we provide the API, we need to make sure it's correct in all use case, not just how we "intend" to use it, our intention could easily change over time. I would suggest we just remove the set_weight API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true in real life. Thanks a lot! fix it.

}

protected:
double learning_rate;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the learning_rate in LinearLr is private, but here is protected? Consider change to private.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix done.

const paddle_element_type data_type,
const void* grad_buffer,
int num_bytes) {
// TOOD(zhihong): datatype not work. need to add the runtime datatype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! let's fix it in a follow up PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed.


int paddle_optimizer_get_state(paddle_optimizer* o, const char* state) {
state = o->impl->SerializeState();
return PADDLE_SUCCESS;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to return the length of state here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed. by the way, change the get_weights interface too.

config.linear_lr().lr_decay_a(),
config.linear_lr().lr_decay_b());
// default
return nullptr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGDOptimizer并没有检查lr_policy_是不是nullptr。其实如果这里return new ConstLr(0.001);就没有这个问题了。
另外,需要打印warning log,说找不到learning rate policy,用了default learning rate policy。不然如果用户以为自己设了learning rate policy,其实没有设,就完全没法debug了。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍,fix done.

lr);
}
// default
return new SGDOptimizer(config.sgd().momentum(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

代码运行到这里,config.optimizer() != OptimizerConfig::SGD,所以config.sgd().momentum(), config.sgd().decay(), config.sgd().nesterov()也是没用的值吧,似乎

return new SGDOptimizer(0, 0, false);

更合理些?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

proto2支持default。因此default value都写在protobuf里了。fix done, 之后都会改到optimizer constructor里

config.adadelta().decay(),
lr);
}
// default
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please print warning log when user have not select any optimizer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix done.

virtual void set_weight(Tensor *parameter);

protected:
OptimizerConfig config_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix done.

return parameter_->get_buffer();
}

void ParameterOptimizer::set_weight(Tensor *p) { parameter_ = p; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to free previous parameter_ if not NULL.
Btw, removing this function and put in constructor simplifies things, child class no longer need to implement this. More code = potentially more bugs. I am fine either way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix done. 将C_API的接口合并成一个了。

@@ -0,0 +1,20 @@
#include <iostream>
Copy link
Member

@jacquesqiao jacquesqiao Jun 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should add Copyright message to these new files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CopyRights declaration can be added when we merge these modules together. I noticed that go module part all need to add that header section.

@@ -0,0 +1,35 @@
#ifndef PADDLE_SGD_OPTIMIZER_H_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#pragma once

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix done.

// match old training state with format parser
required string version = 100;
repeated TensorProto data = 1;
repeated double hyperparam = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should follow a uniform naming style. hyperparam ==> hyper_param?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hyperparameter is a word, not two words, named with hyperparam for short.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get it~~

optional LinearLr linear_lr = 13;

// common config of optimizer
optional double clipnorm = 101;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clip_norm and clip_value is better?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be some comment about the usage of these two member is better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, change the name style.

@@ -0,0 +1,113 @@
syntax = "proto2";

option optimize_for = LITE_RUNTIME;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

其实不是太明白option optimize_for = LITE_RUNTIME的用途,能解释下不,包括LITE_RUNTIME这个命名的来历。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

protobuf提供了option这个选项,可以作为优化来用, lite_runtime就是轻量级版本,include的头文件不同,相当于换了一组实现,序列化这个lite版本的接口多了几个。see here for detail. https://developers.google.com/protocol-buffers/docs/proto

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

明白了,原来是protobuf的特性,多谢解释~~

T* data_;
};

// TODO(zhihong): design problem of dynamic datatype, need to fix it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that when porting "majel" to PaddlePaddle, we already included boost/variant.hpp for the "single value multiple type" container.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, 👍. Either we can wait for their majel port job finish, or implement another one with typeid reflection. It is a follow-up question.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

@@ -0,0 +1,108 @@
#include "parameter_optimizer.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we put unit-test files to a "test" folder?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember that WangYi's suggestion that tests should be named with XXX_test.cpp and put it into the same folder. But I can't find the link....

Copy link
Member

@jacquesqiao jacquesqiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM~


int paddle_optimizer_get_state(paddle_optimizer* o, const char** state) {
*state = o->impl->SerializeState();
return strlen(*state);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strlen does not work here. The state is a binary byte array, not a c string. It could have 0 in the body.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

额,的确是一个byte vector, 这么写是下面用了cstring,就没去累加长度
https://github.com/PaddlePaddle/Paddle/pull/2386/files#diff-14071fab2b21853a7bc47f69021ba1f1R46
fix done with accumulate length.

virtual ~ParameterOptimizer() { delete parameter_; };

static ParameterOptimizer *Create(const std::string &config_proto);
virtual const char *SerializeState();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make SerializeState and DeSerializeState pure virtual? If derived class don't have a state, they can put a empty implementation. Otherwise derived class may forget to put an implementation but forget to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point! fix done.

<< "expected : " << kOptimizerVersion << "get : " << state.version();

ProtoToTensor(state.data(0), parameter_);
if (state.data_size() == 2) {
Copy link
Contributor

@helinwang helinwang Jun 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我觉得引入proto的目的就是不用自己定义序列化和反序列化的逻辑。这里用了proto,但是还是自己定义了序列化/反序列化的逻辑,能自动生成的代码如果自己写了,会引入更多出bug可能性。
举个例子:
这里只有两个TensorProto: parameter_, momentums_,这一行(if (state.data_size() == 2) {)说的是如果第二个TensorProto存在,把它parse成momentums。但是如果有三个TensorProto: parameter_, momentums_, foo。怎么表达有parameter_,foo,但是没有momentums_呢?

建议这样定义:
删掉OptimizerState,加入:

message SGDOptimizerState {
  optional TensorProto parameter_ = 0;
  optional TensorProto momentums_ = 1;
  optional double momentum = 2;
  optional double decay = 3;
  optional bool nesterov = 4;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

写成repeated类型的原因是计划用 template trick 做addField, getField ,达到archive, unfolding的功能,避免把所有的参数都使用optional定义一次。写了一下发现实现并不容易,做了一个功能有点奇怪的版本。。。本来计划之后的PR里做这个。
这里这么写的确更简单,fix done.


message OptimizerState {
// match old training state with format parser
required string version = 100;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

protobuf这个序列化格式是支持向前兼容以及向后兼容的,再定义一个版本version有一些奇怪。如果定了一个version,我们以后怎么维护多个version呢?如果维护规则跟proto默认的规则一致,为何不交给protobuf维护向前兼容以及向后兼容。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我是考虑optimizer含有/不含 gradient clipping这样的功能,虽然解析格式相同,但是不能重用,因此加入version。不过其实可以去掉。
fix done.

}
required DataType data_type = 1;
repeated bytes content = 2;
optional uint64 size = 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

contentsize貌似重复了?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, fix done.


message ConstLr {
// learninRate Policy
required double learning_rate = 1 [default = 1.0];
Copy link
Contributor

@helinwang helinwang Jun 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

proto用不用required有很多争论,proto3也去掉了required,咱们这个use case应该都行。这里是些background:https://capnproto.org/faq.html#how-do-i-make-a-field-required-like-in-protocol-buffers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

很棒的材料,学习了! 这么看还是去掉比较好。
fix done

TensorT(size_t size) : height_(1), width_(size) { data_ = new T[size]; }
TensorT(T* data, size_t size) : height_(1), width_(size), data_(data) {}
TensorT(T* data, size_t h, size_t w) : height_(h), width_(w), data_(data_) {}
TensorT(const TensorT& t)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain to me what does ":" do here? Sorry I am not too familiar, and don't know what's the keyword to search for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is initializer in c++, which is the idiomatic way in c++ initializes parameter.

please check here for detail. http://en.cppreference.com/w/cpp/language/direct_initialization

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

括号后面的":",这个是c++里的初始化手段,和构造函数还不是同一个概念。初始化列表和构造函数的关系类比python的__new__和__init__的关系。初始化列表会在构造函数前完成(就是花括号里的东西)。
中文叫 初始化列表,英文叫 constructor initializer list。
1、初始化列表在任何函数执行之前完成
2、初始化列表中的参数赋值顺序是由成员声明顺序决定
并且对于非POD类型的成员具有限制:
https://stackoverflow.com/questions/5816218/difference-between-initializer-and-default-initializer-list-in-c
https://stackoverflow.com/questions/9903248/initializing-fields-in-constructor-initializer-list-vs-constructor-body

一般推荐非静态成员都使用该方法初始化

TensorT(T* data, size_t size) : height_(1), width_(size), data_(data) {}
TensorT(T* data, size_t h, size_t w) : height_(h), width_(w), data_(data_) {}
TensorT(const TensorT& t)
: TensorT(1, t.size(), 0, t.get_buffer(), false, false) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess here is copy constructor, and what here is doing is that it created a new tensor, copying from the old tensor. And they shared the same buffer.
Seems when the two tensors gets destroyed, they will try to destroy the same buffer twice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 fix done.

state.set_momentum(rho_);
state.set_decay(decay_);
// can be used when memory alignment to system
*state_len += CalStateSize(parameter_,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why += but not =? Also, the comment above could be more descriptive :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it can be =, I thought that we may add other options, += is a more general way to represent a accumulation relationships.

Copy link
Contributor

@helinwang helinwang Jun 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, that's a good point. However, if I were the user and see this function signature const char* AdadeltaOptimizer::SerializeState(int* state_len), I would think state_len means the length of state returned by this function. Not including other length.
Besides, people may use it like this:

int state_len;
foo->SerializeState(&state_len);

The result is likely to be wrong, since state_len is not initialized.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍!that's right,it is a potential bug here. Thanks!

}
}

static void TensorToProto(const Tensor& tensor, TensorProto* proto) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add unit test for TensorToProto and ProtoToTensor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will add unit test, to be fixed.

repeated bytes content = 2;
}

message OptimizerState {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we create one specific OptimizerState for each Optimizer? Like AdamOptimizerState, ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same answer with above one. It is easy to implement and understand. But I think we have a more proper way to do that.

return state.SerializeAsString().c_str();
}

void AdadeltaOptimizer::DeSerializeState(const std::string& str) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeSerializeState -> DeserializeState, deserialize is a single word.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

..... my poor English. fix done.

OptimizerState state;
state.ParseFromString(str);
lr_policy_->set(state.learning_rate());
num_sample_passed_ = state.num_sample_passed();
Copy link
Contributor

@helinwang helinwang Jun 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is missing epsilon_, rho_, decay_.
Maybe to avoid potential error like this, we could do (just a suggestion):

class AdadeltaOptimizer {
private:
  AdadeltaOptimizerState state_;
}

AdadeltaOptimizer::Update(...) {
  // directly use state_.learningRate
}

// Clear ProtoTensors when not used to save memory.
AdadeltaOptimizer::ClearStateTensors() {
  state_.parameter().Clear();
  state_.accum_gradient().Clear();
  state_.accum_delta().Clear();
  state_.update_delta().Clear();
}

AdadeltaOptimizer::SerializeState() {
  TensorToProto(*parameter_, state_.mutable_parameter());
  TensorToProto(*accum_gradient_, state_.mutable_accum_gradient());
  TensorToProto(*accum_delta_, state_.mutable_accum_delta());
  TensorToProto(*update_delta_, state_.mutable_update_delta());
  auto r = state.SerializeAsString().c_str();
  ClearStateTensors();
  return r;
}

AdadeltaOptimizer::DeserializeState() {
  state_.ParseFromString(str);
  ProtoToTensor(state_.parameter(), parameter_);
  ProtoToTensor(state_.accum_gradient(), accum_gradient_);
  ProtoToTensor(state_.accum_delta(), accum_delta_);
  ProtoToTensor(state_.update_delta(), update_delta_);
  ClearStateTensors();
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

firstly, epsilon_, rho_, decay_ is not missing by careless. Consider that we have to call create_with_config function signature with config, which contains these three parameters. And these ones is constant varible during training process. Do we really need to save them since we already have to save the config proto file?
If we want the training state self-contained, these constant hyperparameters should be there.

secondly, In that format, obviously it is easy to reading. But need a pair of config/state for every optimizer, which lead to copy/paste config config and Serialize/Deserialize code many times, we should find another elegant and idiomatic way to do that. The idea is the same way goes with https://github.com/caffe2/caffe2/blob/master/caffe2/core/qtensor_serialization.h

Copy link
Contributor

@helinwang helinwang Jun 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I see, they are constants initialized by constructor, and constructor arguments are already saved by OptimizerConfig. My bad. Sorry!

  • which lead to copy/paste config config and Serialize/Deserialize code many times

    I understand the config proto could have duplicates (e.g., probably every config proto have a learning rate field.). I would argue that having one config proto per optimizer makes people understand what a specific optimizer's argument easily. Please take a look at this example, we are currently using a single config proto LayerConfig for all layers. My opinion is it's too hard for anyone to read, and understand what parameter each layer needs.

    You mentioned "Serialize/Deserialize code" need to be copy / pasted many times. I think with the approach of having one config for all optimizers, we still need to write the Serialize/Deserialize code for each class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.
the config proto e.g. SGDConfig only contains constant hyperparameter, the state proto only contains the others. Combine them together will create the optimizer with a state. Separate different optimizer state into independently for reading. I thought this would be better than adding a state member into each optimizer class.
fix done.

config.mutable_const_lr()->set_learning_rate(0.1);

ParameterOptimizer* opt =
ParameterOptimizer::Create(config.SerializeAsString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test case does not compile, need to be updated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix done.


const char* AdadeltaOptimizer::SerializeState(int* state_len) {
AdadeltaOptimizerState state;
state.set_learning_rate(lr_policy_->LearningRate(num_sample_passed_));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里貌似没有必要存learning_rate,learning_rate是policy的状态。我们可以放一个存policy状态的TODO以后再加。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's fine.


TensorToProto(*parameter_, state.mutable_parameter());
TensorToProto(*accum_gradient_, state.mutable_accum_gradient());
*state_len = CalStateSize(parameter_, accum_gradient_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里似乎可以用:

auto str = state.SerializeAsString();
*state_len = str.length();
return str.c_str();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里有一点点问题,serializaAsString返回的是basic_string

auto str = state.SerializeAsString();
auto &str = state.SerializeAsString(); // wrong, cannot go this way

第一种做了一次赋值构造,是大块的buffer,是有代价的。

这么写更易读,修改了
fix done.

Copy link
Contributor

@helinwang helinwang Jun 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's always choose code clarity over code performance. And only optimize performance when necessary.

void AdagradOptimizer::DeserializeState(const std::string& str) {
AdagradOptimizerState state;
state.ParseFromString(str);
lr_policy_->set(state.learning_rate());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

以后需要让lr_policy序列化其state,这里调用lr_policy_->DeserializeState(lr_state)。可以放个TODO在这里。
lr_policy应该是拿到序列化好的state自己去反序列化,而不是optimizer给lr_policy设值(optimizer并不知道lr_policy的state有哪些),建议删掉这一行。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree. fix done.

double decay)
: ParameterOptimizer(parameter, lr), epsilon_(epsilon), decay_(decay) {
size_t size = parameter->size();
if (accum_gradient_) delete accum_gradient_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里是constructor,不需要查accum_gradient_

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix done.

int num_bytes) {
// TOOD(zhihong): datatype not work. need to add the runtime datatype
auto grad_type = reinterpret_cast<const float*>(grad_buffer);
Tensor* gradient = new Tensor(const_cast<float*>(grad_type), num_bytes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

进来的buffer是const void*表明这个函数不能修改这个buffer,也不能释放这个buffer。这里做了一个cast,去掉了const的属性,然后Tensor又拥有了这个buffer,Tensor会在destruct的时候释放这个buffer。不应该释放不属于自己的buffer。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里避免拷贝内存,的确有这个问题。改用shared_ptr
fix done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

明白了,不让Tensor拥有直接传的指针确实可以解决问题。

*/
ParameterOptimizer(Tensor *parameter, LrPolicy *lr)
: parameter_(parameter), lr_policy_(lr), num_sample_passed_(0) {}
virtual ~ParameterOptimizer() { delete parameter_; };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

貌似也需要delete lr_policy_;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix done.

if (momentum_ != 0.0) {
size_t size = parameter->size();
// TODO: fix it with align aware allocator bind to Tensor
if (momentums_) delete momentums_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

contrustor里面不应当delete momentums_,这时候momentums_应该还没被初始化,可能是任何值。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix done.

public:
SGDOptimizer(Tensor* parameter, LrPolicy* lr, double m, double d, bool n)
: ParameterOptimizer(parameter, lr),
momentums_(nullptr),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里可以写成momentums_(new Tensor(size)),吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里需要判定momentum_是否为.0,然后创建buffer。其他地方改写了

int num_bytes) {
// TOOD(zhihong): datatype not work. need to add the runtime datatype
auto grad_type = reinterpret_cast<const float*>(grad_buffer);
Tensor* gradient = new Tensor(const_cast<float*>(grad_type), num_bytes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

明白了,不让Tensor拥有直接传的指针确实可以解决问题。

TensorT(CpuMemHandlePtr handle, size_t size)
: height_(1),
width_(size),
data_(reinterpret_cast<T*>(handle->getBuf())) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里std::make_shared<CpuMemoryHandle>(size * sizeof(float))没人引用了,执行完buffer就立刻被释放掉了。
我想了下能不能这样写作为参考:

template <class T>
class TensorT {
public:
  TensorT(size_t size)
    : TensorT(std::shared_ptr<T> sp(new int[size], std::default_delete<T[]>()), size) {
  }
  TensorT(std::shared_ptr<T> data, size_t size)
    : height_(1),
      width_(size),
      data_ptr_(data) {}

  TensorT(T* data, size_t size) : TensorT(data, 1, size) {}

  TensorT(T* data, size_t h, size_t w) : height_(h), width_(w), data_(data) {}

  virtual ~TensorT() {}

  T* get_buffer() {
    auto data = data_;
    if (data == nullptr) {
      data = data_ptr.get();
    }

    return data;
  }

  T& operator[](const size_t idx) {
    auto data = data_;
    if (data == nullptr) {
      data = data_ptr.get();
    }
    CHECK(idx >= 0 && idx < this->width_) << "out of index range";
    return data[idx];
  }
  T& operator[](const size_t idx) const {
    if (data == nullptr) {
      data = data_ptr.get();
    }
    CHECK(idx >= 0 && idx < this->width_) << "out of index range";
    return data[idx];
  }
  // TODO: replace with tensorshape                                                                                                                                                                
  size_t size() const { return this->width_ * this->height_; }

protected:
  size_t height_;
  size_t width_;
  std::shared_ptr<T*> data_ptr_; // managed data                                                                                                                                                   
  T* data_; // unmanaged data                                                                                                                                                                      
};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's right, I made a mistake here, Tensor have to save the shared_ptr for extending its lifetime.
fix it, init data_ with data_ptr_.

Copy link
Contributor

@helinwang helinwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for being so patient and modify the PR many times!

@dzhwinter dzhwinter merged commit 9b755c0 into PaddlePaddle:develop Jun 20, 2017
heavengate pushed a commit to heavengate/Paddle that referenced this pull request Aug 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants