-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix multidense compatibility with tf 2.16 and keras 3 #1975
Conversation
@scarlehoff I'm not sure this fix works, and 3.12 isn't in the CI yet, do you have a PR which has the new test where I could push this to (or rebase this on)? I couldn't find it. |
I have added 3.12 to the tests. It might fail because of other reasons of course (if it runs!), but let's see whether this one is fixed and then revert that commit. |
When trying it in my computer I got this error though:
|
fixing that generates further errors I'm afraid
so keras 3 might be breaking quite a lot of stuff... |
Yes was just about to say that I would expect that... I think transitioning to Keras 3 with multiple supported backends would be great, especially when all those backend wrappers would be removed (or at least moved down a level). But timing wise perhaps this is not the best moment.. One thing you can check is what happens with the dense-per-flavour layer, but I think the error you quote above would also arise there. |
It is not that bad, the problem right now seem to be
I'm not sure where that's happening since the traceback is just the training so we are at some point calling something we are not allowed to call (but it should've failed earlier during compilation of the model... it's, at least partially, a tensorflow bug) Edit: setting eager execution everything work other than the log broken down by experiment because they have changed the reporting but that's minor. |
numpy is used in |
It seems to be inside tensorflow/keras. More specifically it is understanding the preprocessing factor variables as numpy arrays. Which it really seems like a bug in tensorflow in that if there's anything wrong with them they should've been caught some time ago by the time the training starts. |
Indeed. The problem is there only when the preprocessing is trainable. |
In my computer it works for 3.12 but it breaks now for 3.11. Which I'd say it is ok, now we just have to either have a conditional or fix it in some other way, but it's good :) Let's see what happens with the rest of the tests. |
The errors are in the |
(sorry for the spam of commits, I will squash all changes together I've been basically using the CI as the test machine) |
97739b7
to
db760e4
Compare
Greetings from your nice fit 🤖 !
Check the report carefully, and please buy me a ☕ , or better, a GPU 😉! |
Please @APJansen when you have time could you make sure that the changes I made are sensible? @Cmurilochem I had to change the scope of the hyperopt pickle test to check that indeed when starting from a pickle the next first trial does give you the same parameters as if they had been run sequentially... however the results after the first trial change. It's a sub-% change but that's enough to change the following trials. How important is that, after a restart, the hyperopt continues exactly on the same path after the first trial? If it is important it might be necessary to investigate where this difference is coming from before merging (or blocking hyperopt runs to only use tensorflow <2.16 / python <3.12 / numpy < X.YY... not sure which library is at fault here tbh) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking care of this, I left some comments. (can't approve as I started the PR)
Most importantly, have you checked that these try/except's don't occur inside the training loop? Because I imagine this can cause some significant slowdown, if you're in the except branch.
It should be compiled away upon first pass but it's true, it might be better to have an if condition to generate the function just in case... |
I would say it's not super-important. The restart was mainly there to get around cluster allocation time limits, reproducability is more of a nice to have. |
Nice! Then @APJansen is it fine if I merge this? (I'll rebase on top of fk refactor in case there are more 2.16 fixes to be done, better safe than sorry) |
47194ed
to
6571169
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes ok by me, apart from the one comment.
6571169
to
7babfe2
Compare
I agree with @goord. But remember that at the time we added some constraints to warrant this, namely:
and nnpdf/n3fit/src/n3fit/model_trainer.py Lines 876 to 877 in 7babfe2
But please feel free to change the scope of the test and go over this. |
This problem seems to only appear for python3.12 / tensorflow 2.16 so it might be safer to run hyperopt with <3.12 for the time being just in case then... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can have the conda package yet, but at least we can start using the code in py3.12 (and maybe finding other issues!)
kicking down the can recover previous behaviour try-except for 3.11 deal with type missmatch make sure units are int remove pdb fix change in how weights are named Update n3fit/src/n3fit/tests/test_hyperopt.py 0 is understood as None by initializer change scope of hyperopt test bugfix 312
7babfe2
to
9e6dd2c
Compare
No description provided.