Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[In Progress] Fix bug: enable sparse weigth setting in trainer_config_helper APIs #985

Closed
wants to merge 1 commit into from

Conversation

backyes
Copy link
Contributor

@backyes backyes commented Dec 21, 2016

fix #948

为什么开这个pr:

  • 主要目标是通过这个pr和issue梳理清楚sparse相关问题,并记录文档方便其他开发者和用户追踪
  • 同时, 尝试fix 一些sparse weight的BUG(区别于sparse update的稀疏策略):

BUG相关:

  • 新接口暂时不支持设置sparse weight训练。

完善新接口对sparse支持,需要进一步分析的问题:

  • 确定nnz 默认值如何设置比价合理,依据是什么?
+Layer(
+    name = "layer1_5",
+    type = "fc",
+    size = 3,
+    active_type = "tanh",
+    inputs = Input("input",
+              learning_rate=0.01,
+              momentum=0.9,
+              decay_rate=0.05,
+              initial_mean=0.0,
+              initial_std=0.01,
+              format = "csc",
+              nnz = 4)
+)

从icode 老版本git 历史: commit af92dcde6afc4454354089e47870c7ef38dfeda3 看,上述nnz=4的配置得来?

  • 是否可以默认使用csr稀疏格式。 从理论上parameter weight的稀疏存储是一种内部格式,跟数据源无任何关系,因此不建议将老接口中的format参数,导出到用户。 (csr和csc格式在计算上应该没有什么性能差异? @reyoung 可以comment这个观点)

  • 老接口中,只有FCLayer和SelectiveFCLayer支持sparse weight的配置,其他layer均不支持。 因此, 是否需要将这个参数作为general的parameter attribute存在? 是否要将它实现到layer的特殊属性? (新接口设计初衷之一,也是为了简化用户理解的接口,所以我们应该尽量遵循这个准则来设计接口)

除了fix 接口问题之外,还有一些疑问:

  • 理论上, parameter weight 设置成sparse的特性后, 一般并不能进一步优化训练阶段的forward和backward计算耗时。 因为一般data是sparse的配置后,forward应该已经是sparse计算了的了(backward是否尚需确认?), 另外再加上sparse update和sparse remote update使能后,再单独设置parameter weight的sparse 应该没有什么性能提升?

  • 如果parameter weight 设置成sparse的特性,是为了生成稀疏的模型用户预测,那么实际上应该可以通过仅仅在save model阶段进行稀疏存储即可,无需在训练计算阶段进行这种稀疏化处理?

因此它存在的价值是什么?

@backyes
Copy link
Contributor Author

backyes commented Dec 21, 2016

@reyoung @qingqing01 @lcy-seso 请知晓

@backyes
Copy link
Contributor Author

backyes commented Dec 22, 2016

潜在 BUG Update: (SHA1: 28c5010

  • 如果同时配置sparse_update, sparse weight, 且输入sparse训练数据,会报
void GpuMatrix::mul(const Matrix& a,
                    const Matrix& b,
                    real scaleAB,
                    real scaleT) {
  const auto a_ptr = dynamic_cast<const GpuMatrix*>(&a);
  const auto b_ptr = dynamic_cast<const GpuMatrix*>(&b);
  const auto a_ptr_s = dynamic_cast<const GpuSparseMatrix*>(&a);
  const auto b_ptr_s = dynamic_cast<const GpuSparseMatrix*>(&b);

  if (a_ptr && b_ptr) {
    mul(*a_ptr, *b_ptr, scaleAB, scaleT);
  } else if (a_ptr_s && b_ptr) {
    mul(*a_ptr_s, *b_ptr, scaleAB, scaleT);
  } else if (a_ptr && b_ptr_s) {
    mul(*a_ptr, *b_ptr_s, scaleAB, scaleT);
  } else {
    LOG(FATAL) << "Not supported";
  }
}

最后一个分支错误。

  • 如果使能sparse 训练数据、使能sparse weigth,关闭sparse_updater, 且trainer_count=1, 系统会crash在以下:
27	  return momentum_;
(gdb) bt
#0  0x0000000000dfacaa in paddle::ParameterConfig::momentum (this=0x0)
    at /home/wangyanfei/paddle_internal_release_tools/idl/paddle/build/proto/ParameterConfig.pb.h:727
#1  0x0000000000e0ed86 in paddle::SparseMomentumParameterOptimizer::init (this=0x22403d0, numRows=6,
    config=0x0)
    at /home/wangyanfei/paddle_internal_release_tools/idl/paddle/Paddle/paddle/parameter/FirstOrderOptimizer.cpp:44
#2  0x0000000000c3eca3 in paddle::SgdLocalUpdater::init (this=0x2240330, parameters=...)
    at /home/wangyanfei/paddle_internal_release_tools/idl/paddle/Paddle/paddle/trainer/ParameterUpdater.h:69
#3  0x0000000000c35956 in paddle::Trainer::init (this=0x7fffffffd650, config=..., testing=false,
    gradientMachine=..., dataProvider=..., testDataProvider=...)
    at /home/wangyanfei/paddle_internal_release_tools/idl/paddle/Paddle/paddle/trainer/Trainer.cpp:245
#4  0x0000000000a4ea26 in main (argc=11, argv=0x7fffffffda18)
    at /home/wangyanfei/paddle_internal_release_tools/idl/paddle/Paddle/paddle/trainer/TrainerMain.cpp:96
(gdb) f 2
#2  0x0000000000c3eca3 in paddle::SgdLocalUpdater::init (this=0x2240330, parameters=...)
    at /home/wangyanfei/paddle_internal_release_tools/idl/paddle/Paddle/paddle/trainer/ParameterUpdater.h:69
69	    optimizer_->init(parameters_.size(), nullptr);
(gdb)

原因是, trainer_count=1(实际上应该是关闭sparse updater) 会使能local updater, 它不支持基于参数的优化策略,只能支持全局的优化策略。 (问题类似L1正则的问题)

也间接说明, sparse momentum不能与sgdLocalUpdater共存。

    • 如果使能sparse 训练数据、使能sparse weigth,关闭sparse_updater, 且trainer_count > 1, 系统会crash
INFO 2016-12-22 23:14:12,307 networks.py:1472] The output order is [__cost_0__]
I1222 23:14:12.309047  6331 Trainer.cpp:176] trainer mode: Normal
*** Aborted at 1482419652 (unix time) try "date -d @1482419652" if you are using GNU date ***
PC: @           0xab88a4 paddle::VectorT<>::getSize()
*** SIGSEGV (@0x30) received by PID 6331 (TID 0x7f9f1dfd9780) from PID 48; stack trace: ***
    @     0x7f9f1dbb3160 (unknown)
    @           0xab88a4 paddle::VectorT<>::getSize()
    @           0xdf6e68 paddle::Parameter::setMat()
    @           0xbc5f54 paddle::Parameter::enableType()
    @           0xbd38e9 paddle::parameterInitNN()
    @           0xbcf57a _ZNSt5_BindIFPFviPN6paddle9ParameterEPSt6vectorISt10shared_ptrIS1_ESaIS5_EEESt12_PlaceholderILi1EESB_ILi2EES8_EE6__callIvJOiOS2_EJLm0ELm1ELm2EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE
    @           0xbcd582 _ZNSt5_BindIFPFviPN6paddle9ParameterEPSt6vectorISt10shared_ptrIS1_ESaIS5_EEESt12_PlaceholderILi1EESB_ILi2EES8_EEclIJiS2_EvEET0_DpOT_
    @           0xbcac24 std::_Function_handler<>::_M_invoke()
    @           0xbd7d32 std::function<>::operator()()
    @           0xbd412b paddle::NeuralNetwork::init()
    @           0xbbef84 paddle::TrainerThread::TrainerThread()
    @           0xbbcf25 paddle::MultiGradientMachine::MultiGradientMachine()
    @           0xbe18bf paddle::GradientMachine::create()
    @           0xc3b80c paddle::TrainerInternal::init()
    @           0xc3526b paddle::Trainer::init()
    @           0xa4ea26 main
    @     0x7f9f1c9d8bd5 __libc_start_main
    @           0xa4df29 (unknown)
local.sh: line 15:  6331 Segmentation fault      (core dumped) PYTHONPATH=./:../../../python ../../../../build/paddle/trainer/paddle_trainer --use_gpu=0 --config=./sparse_trainer_config.py --saving_period=1 --test_period=0 --num_passes=4 --dot_period=2 --log_period=20 --trainer_count=2 --saving_period_by_batches=5000 --local=1

crash 在多卡gradient 初始化那里

  • 如果使能sparse 训练数据、使能sparse weigth, 使能sparse_updater, 且trainer_count > 1, 系统会crash
1222 23:16:26.934692 18161 Trainer.cpp:125] ignore sparse_remote_update=true due to  --local=true
I1222 23:16:26.934728 18161 Trainer.cpp:173] trainer mode: SgdSparseCpuTraining
F1222 23:16:26.973275 18161 Parameter.cpp:219] Check failed: height * width == bufs_[pType]->getSize() (290916736 vs. 4)
*** Check failure stack trace: ***
    @           0xf2aba4  google::LogMessage::Fail()
    @           0xf2aafc  google::LogMessage::SendToLog()
    @           0xf2a591  google::LogMessage::Flush()
    @           0xf2d352  google::LogMessageFatal::~LogMessageFatal()
    @           0xdf781d  paddle::Parameter::setMat()
    @           0xbc5f54  paddle::Parameter::enableType()
    @           0xbbc6fe  _ZZN6paddle20MultiGradientMachineC1ERKNS_11ModelConfigEbENKUliPNS_9ParameterEE_clEiS5_
    @           0xbc1ef1  _ZNSt17_Function_handlerIFviPN6paddle9ParameterEEZNS0_20MultiGradientMachineC1ERKNS0_11ModelConfigEbEUliS2_E_E9_M_invokeERKSt9_Any_dataiS2_
    @           0xbd7d32  std::function<>::operator()()
    @           0xbd412b  paddle::NeuralNetwork::init()
    @           0xbbca86  paddle::MultiGradientMachine::MultiGradientMachine()
    @           0xbe18bf  paddle::GradientMachine::create()
    @           0xc3b80c  paddle::TrainerInternal::init()
    @           0xc3526b  paddle::Trainer::init()
    @           0xa4ea26  main
    @     0x7ff5bd545bd5  __libc_start_main
    @           0xa4df29  (unknown)
local.sh: line 15: 18161 Aborted                 (core dumped) PYTHONPATH=./:../../../python ../../../../build/paddle/trainer/paddle_trainer --use_gpu=0 --config=./sparse_trainer_config.py --saving_period=1 --test_period=0 --num_passes=4 --dot_period=2 --log_period=20 --trainer_count=2 --saving_period_by_batches=5000 --local=1

因为要初始化一个 (matType == MAT_SPARSE_ROW_IDS) 类型矩阵 (为什么单卡没有这个矩阵,尚不明确)?

@backyes
Copy link
Contributor Author

backyes commented Dec 26, 2016

Update:

  • 增强:
    sparse update的配置要check一下, 仅仅使能第一个隐层的weight update,其他的隐层是什么效果,未知。

@luotao1
Copy link
Contributor

luotao1 commented Feb 1, 2019

感谢您给PaddlePaddle贡献代码。由于Paddle V1/V2版本已不再维护,相关代码也已从develop分支上删除,因此关闭您的PR,欢迎您向Paddle最新版-Fluid贡献代码。
Thanks for contributing to PaddlePaddle! Since V1/V2 will not be maintained anymore, and related codes have been deleted from develop branch as well, we close this PR. Welcome to contribute to Fluid——the latest version of PaddlePaddle.

@luotao1 luotao1 closed this Feb 1, 2019
zhhsplendid pushed a commit to zhhsplendid/Paddle that referenced this pull request Sep 25, 2019
wangxicoding pushed a commit to wangxicoding/Paddle that referenced this pull request Dec 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

训练完毕后,模型的存储参数以稀疏(sparse)形式存储的使用场景?
2 participants