Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

paddle cluster SequenceLastInstanceLayer 报错 #1316

Closed
hiahiahu opened this issue Feb 12, 2017 · 3 comments
Closed

paddle cluster SequenceLastInstanceLayer 报错 #1316

hiahiahu opened this issue Feb 12, 2017 · 3 comments
Assignees

Comments

@hiahiahu
Copy link

hiahiahu commented Feb 12, 2017

配置如下:

 73     Layer(
 74         name='backward_lstm_' + doc,
 75         type='lstmemory',
 76         active_type='tanh',
 77         active_gate_type='sigmoid',
 78         active_state_type='sigmoid',
 79         bias=Bias(parameter_name='backward_lstm.bias'),
 80         inputs=Input(
 81             'emb_project' + doc, parameter_name='backward_lstm.w0'), )
 82     Layer(
 83             #name = "backward_first_" + doc,
 84             name="encode_" + seg_type + "_" + doc,
 85             type = "seqfirstins",
 86             active_type = "linear", bias = False,
 87             inputs = [Input('backward_lstm_' + doc)],)
 88

错误如下:

Sun Feb 12 01:57:48 2017[1,67]<stderr>:F0212 01:57:48.500427 25238 Matrix.h:228] Check failed: startRow + numRows <= getHeight() (1071 vs. 1070) 
Sun Feb 12 01:57:48 2017[1,67]<stderr>:*** Check failure stack trace: ***
Sun Feb 12 01:57:48 2017[1,67]<stderr>:    @          0x13c3398  google::LogMessage::Fail()
Sun Feb 12 01:57:48 2017[1,67]<stderr>:    @          0x13c32f0  google::LogMessage::SendToLog()
Sun Feb 12 01:57:48 2017[1,67]<stderr>:    @          0x13c2d85  google::LogMessage::Flush()
Sun Feb 12 01:57:48 2017[1,67]<stderr>:    @          0x13c5b46  google::LogMessageFatal::~LogMessageFatal()
Sun Feb 12 01:57:48 2017[1,67]<stderr>:    @           0x7dd37e  paddle::Matrix::subMatrix()
Sun Feb 12 01:57:48 2017[1,67]<stderr>:    @           0x651929  paddle::SequenceLastInstanceLayer::forward()
Sun Feb 12 01:57:48 2017[1,67]<stderr>:    @           0x6afa40  paddle::NeuralNetwork::forward()
Sun Feb 12 01:57:48 2017[1,67]<stderr>:    @           0x6a44a9  paddle::TrainerThread::forward()
Sun Feb 12 01:57:48 2017[1,67]<stderr>:    @           0x6a6bf5  paddle::TrainerThread::computeThread()
Sun Feb 12 01:57:48 2017[1,67]<stderr>:    @     0x7fcc770788a0  execute_native_thread_routine
Sun Feb 12 01:57:48 2017[1,67]<stderr>:    @     0x7fcc778f61c3  start_thread
Sun Feb 12 01:57:48 2017[1,67]<stderr>:    @     0x7fcc767e912d  __clone

另外:
1、本地多cpu运行不报错,集群多机多cpu会出这个错误
2、将85行改为type='max',不报错
3、

addr2line 0x651929 -e opt/paddle/bin/paddle_trainer
/home/xx.../paddle_internal_release_tools/idl/paddle/Paddle/paddle/gserver/layers/SequenceLastInstanceLayer.cpp:83

对应83行为

68 void SequenceLastInstanceLayer::forward(PassType passType) {
 69   SequencePoolLayer::forward(passType);
 70
 71   const int* starts = startPositions_->getData(false);
 72   MatrixPtr inputValue = getInputValue(0);
 73   MatrixPtr outputValue = getOutputValue();
 74
 75   {
 76     AsyncGpuBlock asyncGpuBlock;
 77     REGISTER_TIMER_INFO("SequenceLastInstanceLayerForward", getName().c_str());
 78
 79     for (size_t seqId = 0; seqId < newBatchSize_; ++seqId) {
 80       int insId =
 81           config_.select_first() ? starts[seqId] : starts[seqId + 1] - 1;
 82
 83       outputValue->subMatrix(seqId, 1, tmpDest_)
 84           ->assign(*(inputValue->subMatrix(insId, 1, tmpSrc_)));
 85     }
 86   }
 87
 88   if (biases_.get() != NULL) {
 89     outputValue->addBias(*(biases_->getW()), 1);
 90   }
 91
 92   /*  activation, should set to 'linear' in most cases */
 93   forwardActivation();
 94 }

而我配置中的batchsize:

Settings(
        algorithm='sgd',
        learning_rate=0.0002,
        batch_size=1000,
        learning_rate_decay_a=0,
        learning_rate_decay_b=0,
        average_window=0, 
        max_average_window=0, 
        )

为什么会出现1071这样的结果(numRows=1)?

startRow + numRows <= getHeight() (1071 vs. 1070) 

4、之前的paddle版本 SequenceLastInstanceLayer.cpp是不会报错的,现在统一抽象出SequencePoolLayer.cpp的做法集群版本怎么就不行了。。附上之前之前版本的SequenceLastInstanceLayer文件
SequenceLastInstanceLayer.cpp.txt

辛苦~~~

@luotao1
Copy link
Contributor

luotao1 commented Feb 13, 2017

将85行改为type='max',不报错

那改成type='seqlastin'呢?

为什么会出现1071这样的结果(numRows=1)?

出现原因多是数据溢出了。

@hiahiahu
Copy link
Author

@luotao1 batchsize=1k 怎么会数据溢出?
seqlastin也不行(reverse=false),会报错

@lcy-seso
Copy link
Contributor

I close this issue due to inactivity. please feel free to reopen it if more information is avaliable.

zhhsplendid pushed a commit to zhhsplendid/Paddle that referenced this issue Sep 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants