Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cost going to NaN with Paddle v0.10.0 for MT example #2563

Closed
alvations opened this issue Jun 22, 2017 · 11 comments
Closed

Cost going to NaN with Paddle v0.10.0 for MT example #2563

alvations opened this issue Jun 22, 2017 · 11 comments
Assignees
Labels

Comments

@alvations
Copy link
Contributor

alvations commented Jun 22, 2017

Installing from source off the develop branch, the paddle command seems to be working fine:

$ git log 
commit 7bce40d7be9174bea90e75df684ce8526485b36a
Merge: 603fd43 252ef0c
Author: gangliao <liaogang@baidu.com>
Date:   Wed Jun 21 10:22:04 2017 +0800

    Merge pull request #2538 from wangkuiyi/generic.cmake-comments
    
    Rewrite tutorial comments in generic.cmake

$ sudo paddle version
PaddlePaddle 0.10.0, compiled with
    with_avx: ON
    with_gpu: ON
    with_double: OFF
    with_python: ON
    with_rdma: OFF
    with_timer: OFF

$ python
Python 2.7.12 (default, Nov 19 2016, 06:48:10) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import paddle.v2 as paddle
>>> paddle.init(use_gpu=True, trainer_count=4)
I0622 16:51:44.955044 28154 Util.cpp:166] commandline:  --use_gpu=True --trainer_count=4 
>>> exit()

And I cloned the book repo and ran the train.py from the machine translation example. But the CPU training stopped with a Floating point exception:

$ git clone https://github.com/PaddlePaddle/book.git
$ cd book/08.machine_translation/

book/08.machine_translation/$ python train.py 
I0622 16:54:11.143401 28309 Util.cpp:166] commandline:  --use_gpu=False --trainer_count=1 
I0622 16:54:11.374763 28309 GradientMachine.cpp:85] Initing parameters..
I0622 16:54:13.712622 28309 GradientMachine.cpp:92] Init parameters done.

Pass 0, Batch 0, Cost 230.933862, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 10, Cost 230.808911, {'classification_error_evaluator': 0.9642857313156128}
.........
Pass 0, Batch 20, Cost 343.881104, {'classification_error_evaluator': 0.916167676448822}
.........
Pass 0, Batch 30, Cost 244.960254, {'classification_error_evaluator': 0.8907563090324402}
.....*** Aborted at 1498121868 (unix time) try "date -d @1498121868" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGFPE (@0x7f2047213b49) received by PID 28309 (TID 0x7f201a5d3700) from PID 1193360201; stack trace: ***
    @     0x7f2048fc8390 (unknown)
    @     0x7f2047213b49 paddle::AssignCpuEvaluate<>()
    @     0x7f204721a9a7 paddle::AssignEvaluate<>()
    @     0x7f2047211183 paddle::adamApply()
    @     0x7f2047208909 paddle::AdamParameterOptimizer::update()
    @     0x7f20471f2b6e paddle::SgdThreadUpdater::threadUpdateDense()
    @     0x7f20471f3d9f _ZNSt17_Function_handlerIFvimEZN6paddle16SgdThreadUpdater11finishBatchEfEUlimE_E9_M_invokeERKSt9_Any_dataOiOm
    @     0x7f2046ffec1c _ZNSt6thread5_ImplISt12_Bind_simpleIFZN6paddle14SyncThreadPool5startEvEUliE_mEEE6_M_runEv
    @     0x7f2045b39c80 (unknown)
    @     0x7f2048fbe6ba start_thread
    @     0x7f2048cf43dd clone
    @                0x0 (unknown)
Floating point exception (core dumped)

Changing to use GPU, the cost goes to NaN:

book/08.machine_translation$ sed -i "s|\(use_gpu=.*\)|use_gpu=True, trainer_count=4\)|g" train.py
book/08.machine_translation$ python test.py
python: can't open file 'test.py': [Errno 2] No such file or directory
ltan@walle1:~/book/08.machine_translation$ python train.py
I0622 17:04:29.819021 28398 Util.cpp:166] commandline:  --use_gpu=True --trainer_count=4 
I0622 17:04:35.025086 28398 MultiGradientMachine.cpp:99] numLogicalDevices=1 numThreads=4 numDevices=4
I0622 17:04:35.179461 28398 GradientMachine.cpp:85] Initing parameters..
I0622 17:04:37.593305 28398 GradientMachine.cpp:92] Init parameters done.

Pass 0, Batch 0, Cost 232.981567, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 10, Cost 284.369263, {'classification_error_evaluator': 0.9420289993286133}
.........
Pass 0, Batch 20, Cost 265.632788, {'classification_error_evaluator': 0.9224806427955627}
.........
Pass 0, Batch 30, Cost 168.668164, {'classification_error_evaluator': 0.9146341681480408}
.........
Pass 0, Batch 40, Cost 119.270068, {'classification_error_evaluator': 0.8965517282485962}
.........
Pass 0, Batch 50, Cost 224.066553, {'classification_error_evaluator': 0.9174311757087708}
.........
Pass 0, Batch 60, Cost 295.795679, {'classification_error_evaluator': 0.9305555820465088}
.........
Pass 0, Batch 70, Cost 256.279614, {'classification_error_evaluator': 0.9599999785423279}
.........
Pass 0, Batch 80, Cost 206.731763, {'classification_error_evaluator': 0.9504950642585754}
.........
Pass 0, Batch 90, Cost 484.451318, {'classification_error_evaluator': 0.9037656784057617}
.........
Pass 0, Batch 100, Cost 181.277283, {'classification_error_evaluator': 0.966292142868042}
.........
Pass 0, Batch 110, Cost 281.560010, {'classification_error_evaluator': 0.9424460530281067}
.........
Pass 0, Batch 120, Cost 198.955090, {'classification_error_evaluator': 0.9693877696990967}
.........
Pass 0, Batch 130, Cost nan, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 140, Cost nan, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 150, Cost nan, {'classification_error_evaluator': 1.0}

Similarly with 1 GPU trainer:

book/08.machine_translation$ sed -i "s|\(use_gpu=.*\)|use_gpu=True, trainer_count=1\)|g" train.py

ltan@walle1:~/book/08.machine_translation$ python train.py
I0622 17:09:47.405041 28503 Util.cpp:166] commandline:  --use_gpu=True --trainer_count=1 
I0622 17:09:52.146150 28503 GradientMachine.cpp:85] Initing parameters..
I0622 17:09:54.330538 28503 GradientMachine.cpp:92] Init parameters done.

Pass 0, Batch 0, Cost 253.607739, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 10, Cost 245.239307, {'classification_error_evaluator': 0.9495798349380493}
.........
Pass 0, Batch 20, Cost 362.484961, {'classification_error_evaluator': 0.9034090638160706}
.........
Pass 0, Batch 30, Cost 228.537988, {'classification_error_evaluator': 0.9099099040031433}
.........
Pass 0, Batch 40, Cost 277.921631, {'classification_error_evaluator': 0.9333333373069763}
.........
Pass 0, Batch 50, Cost 273.311084, {'classification_error_evaluator': 0.8872180581092834}
.........
Pass 0, Batch 60, Cost 310.044189, {'classification_error_evaluator': 0.9006622433662415}
.........
Pass 0, Batch 70, Cost 262.669629, {'classification_error_evaluator': 0.921875}
.........
Pass 0, Batch 80, Cost 135.404944, {'classification_error_evaluator': 0.9242424368858337}
.........
Pass 0, Batch 90, Cost 272.579102, {'classification_error_evaluator': 0.932330846786499}
.........
Pass 0, Batch 100, Cost 348.291699, {'classification_error_evaluator': 0.929411768913269}
.........
Pass 0, Batch 110, Cost 257.603052, {'classification_error_evaluator': 0.920634925365448}
.........
Pass 0, Batch 120, Cost 212.971094, {'classification_error_evaluator': 0.9903846383094788}
.........
Pass 0, Batch 130, Cost 198.442700, {'classification_error_evaluator': 0.9587628841400146}
.........
Pass 0, Batch 140, Cost 192.191089, {'classification_error_evaluator': 0.936170220375061}
.........
Pass 0, Batch 150, Cost 365.744531, {'classification_error_evaluator': 0.9329608678817749}
.........
Pass 0, Batch 160, Cost 226.738013, {'classification_error_evaluator': 0.9009009003639221}
.........
Pass 0, Batch 170, Cost 294.002539, {'classification_error_evaluator': 0.9444444179534912}
.........
Pass 0, Batch 180, Cost nan, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 190, Cost nan, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 200, Cost nan, {'classification_error_evaluator': 1.0}
.........

I've tried changing the values for:

  • learning rate
  • batch size
  • L2 regularizer
  • gradient clipping
  • encoder/decoder dimensions
  • vocab size

But somehow the cost goes to NaN and I can't seem to go through 1 epoch without the cost going to NaN.

Possibly this is a related issue: #1738

@lcy-seso
Copy link
Contributor

@lcy-seso
Copy link
Contributor

lcy-seso commented Jun 22, 2017

Thanks for reporting the problem for us, I will tune the parameters carefully and fix the NMT demo ASAP. Actually, I have met the same problem also.

@alvations
Copy link
Contributor Author

alvations commented Jun 22, 2017

Thanks @lcy-seso and @kuke !

Adding the error clipping at https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/train.py#L51 avoided the explosion:

        decoder_inputs = paddle.layer.mixed(
            size=decoder_size * 3,
            input=[
                paddle.layer.full_matrix_projection(input=context),
                paddle.layer.full_matrix_projection(input=current_word)
            ], 
           layer_attr=ExtraAttr(error_clipping_threshold=100.0))

Do you know what's the previous setting for the error/gradient clipping in v0.8 and v0.9? Any idea why didn't the gradient/error explode in the previous version?

@lcy-seso
Copy link
Contributor

lcy-seso commented Jun 22, 2017

Previously, most times we directly set a global clipping threshold, and it works fine, but currently, such a global parameter (set in optimizer)setting does not work, you have to set it parameter by parameter if possible.

I think this bug is terrible. It also means globally set regularizer and other parameters are all invalid. Sorry, this must be fixed, I am working on it,

@alvations
Copy link
Contributor Author

Ah now, I understand. Thank you @lcy-seso!

@alvations
Copy link
Contributor Author

@lcy-seso @luotao1

Related issue but not on the NaN, I just realized that the current train.py from the book:

  • doesn't save the model at every epoch and there isn't an explicit parameter to save the model to a custom path/name (少了储存模型的配置参数)

  • Reports the current cost of SGD which is helpful for plotting but without the average cost, it's a little hard to track when to stop early.

@lcy-seso
Copy link
Contributor

You are right.

  1. There are no codes to show how to save the models in PaddleBook. You can check this example (encoder-decoder without attention): https://github.com/lcy-seso/models/blob/refine_seq2seq/nmt_without_attention/train.py#L42. I will fix the NMT book chapter.
  2. v2 API leaves calculating the average cost to users, but we do print it before. I think we need an enhancement. Thank you.

@alvations
Copy link
Contributor Author

alvations commented Jun 23, 2017

@lcy-seso Thanks in advance for fixing it!!

@lcy-seso
Copy link
Contributor

It is a terrible bug that must be fixed. Sorry for it.

@alvations
Copy link
Contributor Author

No worries, it's open source so it should always "kaizen" (改善) =)
Thank you again!

@lcy-seso
Copy link
Contributor

lcy-seso commented Jun 27, 2017

Hi, @alvations, about the bug that some globally set parameters cannot work, there is a way to avoid it, but we will fix this.

  • the reason for the bug
    • PaddlePaddle simplifies the way it parses the network configuration file by introducing some global variable in this PR Fix V2 API #2288 (Before this PR, the configuration parsing process does not contain any global variable).
    • Optimizer is used to set the default values for some parameters, including L2 regularization, gradient_clipping_threshold. These settings are recorded by some global variables, so before parsing the network topology, these global variables should be correctly initialized.
    • This means the definition of the optimizer must be called before the definition of the network topology to make the global parameter settings enable.
  • To avoid the bug:
        optimizer = paddle.optimizer.RMSProp(
            learning_rate=1e-3,
            gradient_clipping_threshold=10.0,
            regularization=paddle.optimizer.L2Regularization(rate=8e-4))
        cost = seq2seq_net(source_dict_dim, target_dict_dim)
        parameters = paddle.parameters.create(cost)
  • This is very dangerous for users because no error is reported. I will fix this. Thank you for reporting this for us.
  • I think this is may be the reason why the gradient/error didn't explode in the previous version because this is a new bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants