New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prediction drops to 0 after certain number of epochs #153

Closed
tambetm opened this Issue Dec 7, 2015 · 24 comments

Comments

Projects
None yet
7 participants
@tambetm

tambetm commented Dec 7, 2015

I'm using Neon for my deep Q-learning code - https://github.com/tambetm/simple_dqn. Recently I noticed an issue, that prediction of my network drops to 0 after certain number of epochs. This can be seen from Q-value graph:
breakout_neon_latest_meanq

Normally it would look like this:
breakout_lives_meanq
This plot was produced using Neon commit hash 7a56fa9 from October 30th. My code is exactly the same.

The most plausible explanation would be, that weights are truncated to 0 at some point. Because my code hasn't changed, I suspect something in Neon code related to saving and loading weights repeatedly. In my code I need to clone a model, and simplest (and most compatible) way of doing that is to just save and load the model. I do this ~45 times before network prediction drops (it doesn't drop always at the same moment).

Any ideas what change could have resulted in such a behavior and how to debug it?

@nervetumer

This comment has been minimized.

Show comment
Hide comment
@nervetumer

nervetumer Dec 8, 2015

Contributor

Do you have access to the serialized model files for epoch 6 and 7? If so, we can directly check to see if the weights are going to zero. The weights are saved via pickle so I don't know how that would truncate on serialization. Maybe there's an issue with another part of the model initialization from a serialized file.

Is 7a56fa9 a working commit (meaning linked to the lower plot)? Which commit is the other plot run on?

Are you serializing and deserializing every epoch? Can we maybe add a logging line to print the average Q-value for each iteration (mini-batch) between the epochs so see if this is sudden of slowly developing over the course of an epoch?

Thanks

Contributor

nervetumer commented Dec 8, 2015

Do you have access to the serialized model files for epoch 6 and 7? If so, we can directly check to see if the weights are going to zero. The weights are saved via pickle so I don't know how that would truncate on serialization. Maybe there's an issue with another part of the model initialization from a serialized file.

Is 7a56fa9 a working commit (meaning linked to the lower plot)? Which commit is the other plot run on?

Are you serializing and deserializing every epoch? Can we maybe add a logging line to print the average Q-value for each iteration (mini-batch) between the epochs so see if this is sudden of slowly developing over the course of an epoch?

Thanks

@tambetm

This comment has been minimized.

Show comment
Hide comment
@tambetm

tambetm Dec 10, 2015

Thanks @nervetumer for suggestions! I checked the serialized models and weights are not zeros. But all the serialized models after epoch 7 are exactly the same (at least means of weights of all layers).

Current hypothesis is, that it might be something related to RMSProp optimizer and state handling. While weight means seem ok, mean state values for all layers are 1.26117e-44 starting from epoch 7. Possibly related to commit 85a478e, because that's the only commit that touched optimizers in between working and not working version. I also tried Adam, and while it doesn't have the exact same problem of predicting zero, it still fails to learn the way it used to.

Yes, 7a56fa9 is the working commit. Upper plot is done with latest GitHub HEAD, probably e479ce3. I'm serializing-deserializing several times during epoch.

Debugging it takes time, I will keep you informed.

tambetm commented Dec 10, 2015

Thanks @nervetumer for suggestions! I checked the serialized models and weights are not zeros. But all the serialized models after epoch 7 are exactly the same (at least means of weights of all layers).

Current hypothesis is, that it might be something related to RMSProp optimizer and state handling. While weight means seem ok, mean state values for all layers are 1.26117e-44 starting from epoch 7. Possibly related to commit 85a478e, because that's the only commit that touched optimizers in between working and not working version. I also tried Adam, and while it doesn't have the exact same problem of predicting zero, it still fails to learn the way it used to.

Yes, 7a56fa9 is the working commit. Upper plot is done with latest GitHub HEAD, probably e479ce3. I'm serializing-deserializing several times during epoch.

Debugging it takes time, I will keep you informed.

@yinyinl

This comment has been minimized.

Show comment
Hide comment
@yinyinl

yinyinl Dec 10, 2015

Contributor

Have you tried running neon right before 85a478e and after to isolate if it is the optimizer?

Are you using gradient_clip_value or gradient_clip_norm in your model? Before that commit 85a478e, it was by clip_gradients set to True, and gradient_limit set to some value.
If the states all of sudden become almost 0 from epoch 7, are you able to see the "grad" value or the "state" from epoch 6? As states are updated by:
state[:] = decay * state + self.be.square(grad) * (1.0 - decay)
Then we can trace further down where causes this issue.

Contributor

yinyinl commented Dec 10, 2015

Have you tried running neon right before 85a478e and after to isolate if it is the optimizer?

Are you using gradient_clip_value or gradient_clip_norm in your model? Before that commit 85a478e, it was by clip_gradients set to True, and gradient_limit set to some value.
If the states all of sudden become almost 0 from epoch 7, are you able to see the "grad" value or the "state" from epoch 6? As states are updated by:
state[:] = decay * state + self.be.square(grad) * (1.0 - decay)
Then we can trace further down where causes this issue.

@tambetm

This comment has been minimized.

Show comment
Hide comment
@tambetm

tambetm Dec 11, 2015

I played with serialization test in test_model.py and I couldn't replicate the exact error, but found another weird result:

print mlp.eval(train_set, Misclassification())

# Serialize model
save_obj(mlp.serialize(keep_states=True), tmp_save)

# Load model
mlp = Model(layers=layers)
mlp.load_weights(tmp_save)

print mlp.eval(train_set, Misclassification()) 

This outputs:

[ 0.74373335]
[ 0.88763332]

Shouldn't the misclassification rates be exactly the same, if evaluated on the same dataset and with the same model? The difference is even more evident, when you increase n_test variable and let it train longer - after loading the model the result is always the same - 0.8876. Amusingly outputs and weights match...

Full code is here (just run python test_model.py):
https://gist.github.com/tambetm/65542362cb24256350d8

tambetm commented Dec 11, 2015

I played with serialization test in test_model.py and I couldn't replicate the exact error, but found another weird result:

print mlp.eval(train_set, Misclassification())

# Serialize model
save_obj(mlp.serialize(keep_states=True), tmp_save)

# Load model
mlp = Model(layers=layers)
mlp.load_weights(tmp_save)

print mlp.eval(train_set, Misclassification()) 

This outputs:

[ 0.74373335]
[ 0.88763332]

Shouldn't the misclassification rates be exactly the same, if evaluated on the same dataset and with the same model? The difference is even more evident, when you increase n_test variable and let it train longer - after loading the model the result is always the same - 0.8876. Amusingly outputs and weights match...

Full code is here (just run python test_model.py):
https://gist.github.com/tambetm/65542362cb24256350d8

@nervetumer

This comment has been minimized.

Show comment
Hide comment
@nervetumer

nervetumer Dec 13, 2015

Contributor

I'm looking into the serialization issue.

Contributor

nervetumer commented Dec 13, 2015

I'm looking into the serialization issue.

@nervetumer

This comment has been minimized.

Show comment
Hide comment
@nervetumer

nervetumer Dec 15, 2015

Contributor

Thanks for noting this issue with the test. It is not a serialization issue but actually has to to with a buffer mismatch that happens when the the second instance of the Model object is generated. Since the layers passed to Model (i.e. Model(layers=layers)) have already been instantiated, they do not get new buffers but a new shared buffer is generated for the MergeMultistream layer container. So when the model is run the second time the layers in MergeMultistream output to the buffer used the first time around but the MergeMultistream layer container is expecting those outputs in a new location. This causes the outputs from the MergeMultistream layer on the second run to remain 0.

That said, we will try to push a fix for this. In the meantime, I'm wondering if you can try it out to see if this is the root of your problem as well - if your model touch on the affected code then maybe this is your issue as well.

The code to try is here:
https://gist.github.com/nervetumer/7a220708967b88cbdcf4

So this should only affect container layers that inherit from the MergeBroadcast class.

Contributor

nervetumer commented Dec 15, 2015

Thanks for noting this issue with the test. It is not a serialization issue but actually has to to with a buffer mismatch that happens when the the second instance of the Model object is generated. Since the layers passed to Model (i.e. Model(layers=layers)) have already been instantiated, they do not get new buffers but a new shared buffer is generated for the MergeMultistream layer container. So when the model is run the second time the layers in MergeMultistream output to the buffer used the first time around but the MergeMultistream layer container is expecting those outputs in a new location. This causes the outputs from the MergeMultistream layer on the second run to remain 0.

That said, we will try to push a fix for this. In the meantime, I'm wondering if you can try it out to see if this is the root of your problem as well - if your model touch on the affected code then maybe this is your issue as well.

The code to try is here:
https://gist.github.com/nervetumer/7a220708967b88cbdcf4

So this should only affect container layers that inherit from the MergeBroadcast class.

@tambetm

This comment has been minimized.

Show comment
Hide comment
@tambetm

tambetm Dec 15, 2015

Did you mean to post only model.py? There is only one minor change there and it didn't really affect the result - the second evaluation still produces 0.8876.

[ 0.73153335]
[ 0.88763332]

I'm not using MergeMultistream in my code, so I guess the fix doesn't apply directly. But once we get this fixed, hopefully I can modify the test code to reproduce my error as well (so far no luck).

I also noticed, that version 1.1.3 (e479ce3) tends to have worse performance than version 1.1.0 (7a56fa9) - average score in Breakout is 305 vs 319. I'm using the exact same model snapshot (breakout_77.pkl), not doing training at all, only prediction.

tambetm commented Dec 15, 2015

Did you mean to post only model.py? There is only one minor change there and it didn't really affect the result - the second evaluation still produces 0.8876.

[ 0.73153335]
[ 0.88763332]

I'm not using MergeMultistream in my code, so I guess the fix doesn't apply directly. But once we get this fixed, hopefully I can modify the test code to reproduce my error as well (so far no luck).

I also noticed, that version 1.1.3 (e479ce3) tends to have worse performance than version 1.1.0 (7a56fa9) - average score in Breakout is 305 vs 319. I'm using the exact same model snapshot (breakout_77.pkl), not doing training at all, only prediction.

@nervetumer

This comment has been minimized.

Show comment
Hide comment
@nervetumer

nervetumer Dec 15, 2015

Contributor

Sorry that was the wrong file. The one i meant to give you layers/containter.py :
https://gist.github.com/nervetumer/7c538802bad3f7a60576

Contributor

nervetumer commented Dec 15, 2015

Sorry that was the wrong file. The one i meant to give you layers/containter.py :
https://gist.github.com/nervetumer/7c538802bad3f7a60576

@tambetm

This comment has been minimized.

Show comment
Hide comment
@tambetm

tambetm Dec 16, 2015

Give me a few days for this. All my Maxwell GPU-s are busy at the moment :).

tambetm commented Dec 16, 2015

Give me a few days for this. All my Maxwell GPU-s are busy at the moment :).

@tambetm

This comment has been minimized.

Show comment
Hide comment
@tambetm

tambetm Dec 18, 2015

As expected, this container fix didn't fix my problem:
breakout_contrainerfix

Back to testing...

tambetm commented Dec 18, 2015

As expected, this container fix didn't fix my problem:
breakout_contrainerfix

Back to testing...

@tambetm

This comment has been minimized.

Show comment
Hide comment
@tambetm

tambetm Jan 4, 2016

I can assure, that the problem is not related to saving-loading the model. When I turned off saving-loading, then 7a56fa9 still learned, but 3676518 didn't. Currently testing if #171 (comment) fixes the problem.

tambetm commented Jan 4, 2016

I can assure, that the problem is not related to saving-loading the model. When I turned off saving-loading, then 7a56fa9 still learned, but 3676518 didn't. Currently testing if #171 (comment) fixes the problem.

@scttl scttl added the bug label Jan 8, 2016

@nervetumer

This comment has been minimized.

Show comment
Hide comment
@nervetumer

nervetumer Jan 14, 2016

Contributor

Hi @tambetm, Have you had a chance to test #171 with this? I'm hoping to get some time to work on this soon.

Contributor

nervetumer commented Jan 14, 2016

Hi @tambetm, Have you had a chance to test #171 with this? I'm hoping to get some time to work on this soon.

@tambetm

This comment has been minimized.

Show comment
Hide comment
@tambetm

tambetm Jan 15, 2016

Nope. Tested with 64 threads as outlined here: #171 (comment). Currently I'm out of ideas.

tambetm commented Jan 15, 2016

Nope. Tested with 64 threads as outlined here: #171 (comment). Currently I'm out of ideas.

@nervetumer

This comment has been minimized.

Show comment
Hide comment
@nervetumer

nervetumer Feb 16, 2016

Contributor

Hi @tambetm can you speak to @yinyinl comment above with respect to isolating the commit that caused this. Is it for sure that 85a478e caused the break?

Contributor

nervetumer commented Feb 16, 2016

Hi @tambetm can you speak to @yinyinl comment above with respect to isolating the commit that caused this. Is it for sure that 85a478e caused the break?

@nervetumer

This comment has been minimized.

Show comment
Hide comment
@nervetumer

nervetumer Feb 22, 2016

Contributor

I think i tracked down the commit that is causing your problem. I think it is adab385. That said I have no idea why this would affect you at all. Will need to look at it closely.

Contributor

nervetumer commented Feb 22, 2016

I think i tracked down the commit that is causing your problem. I think it is adab385. That said I have no idea why this would affect you at all. Will need to look at it closely.

@tambetm

This comment has been minimized.

Show comment
Hide comment
@tambetm

tambetm Feb 22, 2016

Thanks @nervetumer for tracking that down! I've been really busy last week
and this week doesn't look any better... Will test it ASAP.

On Mon, Feb 22, 2016 at 8:03 AM, nervetumer notifications@github.com
wrote:

I think i tracked down the commit that is causing your problem. I think it
is adab385
adab385.
That said I have no idea why this would affect you at all. Will need to
look at it closely.


Reply to this email directly or view it on GitHub
#153 (comment)
.

tambetm commented Feb 22, 2016

Thanks @nervetumer for tracking that down! I've been really busy last week
and this week doesn't look any better... Will test it ASAP.

On Mon, Feb 22, 2016 at 8:03 AM, nervetumer notifications@github.com
wrote:

I think i tracked down the commit that is causing your problem. I think it
is adab385
adab385.
That said I have no idea why this would affect you at all. Will need to
look at it closely.


Reply to this email directly or view it on GitHub
#153 (comment)
.

@nervetumer

This comment has been minimized.

Show comment
Hide comment
@nervetumer

nervetumer Feb 22, 2016

Contributor

If you can confirm that that is the commit causing the issue, that would help. I'm looking at the commit and working with people here to see why it may be causing you an issue.

Contributor

nervetumer commented Feb 22, 2016

If you can confirm that that is the commit causing the issue, that would help. I'm looking at the commit and working with people here to see why it may be causing you an issue.

@nervetumer

This comment has been minimized.

Show comment
Hide comment
@nervetumer

nervetumer Feb 22, 2016

Contributor

@tambetm. I think I have verification that this is the commit that led to your model breaking. I believe the problem may be coming from line 143-145 of neon/layers/layer.py. I'm testing it now and will report as soon as enough epochs run to have some confidence. Still not sure why this would cause a problem.

Contributor

nervetumer commented Feb 22, 2016

@tambetm. I think I have verification that this is the commit that led to your model breaking. I believe the problem may be coming from line 143-145 of neon/layers/layer.py. I'm testing it now and will report as soon as enough epochs run to have some confidence. Still not sure why this would cause a problem.

@nervetumer

This comment has been minimized.

Show comment
Hide comment
@nervetumer

nervetumer Feb 23, 2016

Contributor

Below is the Q plot for the commit SHA adab385

breakout

If I comment out lines 143 and 144 in neon/layers/container.py

       if self.next_layer is None or self.next_layer.parallelism != self.paral
            self.owns_delta = True

I get a good run:

breakout

What is happening is that this code is switching the self.owns_delta from False to True for the Linear layers in your model. Still not exactly sure why this is causing a strange, random error like this. We will try to put a fit for this asap. If you want comment those lines out for now.

Contributor

nervetumer commented Feb 23, 2016

Below is the Q plot for the commit SHA adab385

breakout

If I comment out lines 143 and 144 in neon/layers/container.py

       if self.next_layer is None or self.next_layer.parallelism != self.paral
            self.owns_delta = True

I get a good run:

breakout

What is happening is that this code is switching the self.owns_delta from False to True for the Linear layers in your model. Still not exactly sure why this is causing a strange, random error like this. We will try to put a fit for this asap. If you want comment those lines out for now.

@guiambros

This comment has been minimized.

Show comment
Hide comment
@guiambros

guiambros Feb 24, 2016

I guess you meant neon/layers/layer.py, no?

I'm trying to reproduce the error here with adab385, but no luck yet (but likely my fault). Starting again from scratch; will leave it running overnight and report back tomorrow.

guiambros commented Feb 24, 2016

I guess you meant neon/layers/layer.py, no?

I'm trying to reproduce the error here with adab385, but no luck yet (but likely my fault). Starting again from scratch; will leave it running overnight and report back tomorrow.

@tambetm

This comment has been minimized.

Show comment
Hide comment
@tambetm

tambetm Feb 24, 2016

I can also confirm, that there is no error with c800319 and the error occurs with 257b5d5 (which is one commit after adab385, adab385 had a weird bug and I had to test with the next commit).

The lines that cause the error are beyond my depth with Neon, so I'm relying on you here. I wonder if it would be possible to create a simpler test case to reproduce the issue? I tried it once, based on test_model_serialize() in tests/test_model.py, but failed. Reproducing the issue currently takes couple of hours and that's the main bummer.

tambetm commented Feb 24, 2016

I can also confirm, that there is no error with c800319 and the error occurs with 257b5d5 (which is one commit after adab385, adab385 had a weird bug and I had to test with the next commit).

The lines that cause the error are beyond my depth with Neon, so I'm relying on you here. I wonder if it would be possible to create a simpler test case to reproduce the issue? I tried it once, based on test_model_serialize() in tests/test_model.py, but failed. Reproducing the issue currently takes couple of hours and that's the main bummer.

@nervetumer

This comment has been minimized.

Show comment
Hide comment
@nervetumer

nervetumer Mar 12, 2016

Contributor

We have a PR staged to fix this.

Contributor

nervetumer commented Mar 12, 2016

We have a PR staged to fix this.

@apark263

This comment has been minimized.

Show comment
Hide comment
@apark263

apark263 Mar 14, 2016

Contributor

We're delaying the fix until we're sure that it won't affect anything else, but in the meantime, adding the following lines to your existing code should accomplish the same thing:

Ensure same parallelism mode all the way through the network as follows

Prior to model initialization for self.model here

for l in self.model.layers.layers:
    l.parallelism = 'Disabled'

And prior to model initialization for self.target_model here

for l in self.target_model.layers.layers:
    l.parallelism = 'Disabled'

I should also note that you can now get_description from the model to accomplish your deep copy without serializing to disk (will still transfer weights from device to host and back) by doing:

pdict = self.model.get_description(get_weights=True)
self.target_model.deserialize(pdict, load_states=False)
Contributor

apark263 commented Mar 14, 2016

We're delaying the fix until we're sure that it won't affect anything else, but in the meantime, adding the following lines to your existing code should accomplish the same thing:

Ensure same parallelism mode all the way through the network as follows

Prior to model initialization for self.model here

for l in self.model.layers.layers:
    l.parallelism = 'Disabled'

And prior to model initialization for self.target_model here

for l in self.target_model.layers.layers:
    l.parallelism = 'Disabled'

I should also note that you can now get_description from the model to accomplish your deep copy without serializing to disk (will still transfer weights from device to host and back) by doing:

pdict = self.model.get_description(get_weights=True)
self.target_model.deserialize(pdict, load_states=False)

@jennifermyers jennifermyers added the ready label Jun 23, 2016

@jennifermyers jennifermyers added this to the v1.5.1 milestone Jul 1, 2016

@jennifermyers

This comment has been minimized.

Show comment
Hide comment
@jennifermyers

jennifermyers Jul 1, 2016

Contributor

This should be fixed now in 1.5.1.

Contributor

jennifermyers commented Jul 1, 2016

This should be fixed now in 1.5.1.

@jennifermyers jennifermyers removed the ready label Jul 1, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment