Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix multidense compatibility with tf 2.16 and keras 3 #1975

Merged
merged 4 commits into from
Mar 5, 2024

Conversation

APJansen
Copy link
Collaborator

@APJansen APJansen commented Mar 4, 2024

No description provided.

@APJansen
Copy link
Collaborator Author

APJansen commented Mar 4, 2024

@scarlehoff I'm not sure this fix works, and 3.12 isn't in the CI yet, do you have a PR which has the new test where I could push this to (or rebase this on)? I couldn't find it.

@scarlehoff
Copy link
Member

I have added 3.12 to the tests. It might fail because of other reasons of course (if it runs!), but let's see whether this one is fixed and then revert that commit.

@scarlehoff
Copy link
Member

scarlehoff commented Mar 4, 2024

When trying it in my computer I got this error though:

/NNPDF/src/nnpdf/n3fit/src/n3fit/model_gen.py", line 789, in generate_nn
    pdfs = layer(pdfs)
           ^^^^^^^^^^^
/NNPDF/src/nnpdf/n3fit/src/n3fit/backends/keras_backend/multi_dense.py", line 128,
   output_shape = output_shape[:1] + [self.replicas] + output_shape[1:]
                   ~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~
TypeError: Exception encountered when calling MultiDense.call().
can only concatenate tuple (not "list") to tuple

Arguments received by MultiDense.call():
  • args=('<KerasTensor shape=(1, None, 2), dtype=float32, sparse=None, name=xgrids_processed>',)
  • kwargs=<class 'inspect._empty'>

@scarlehoff
Copy link
Member

scarlehoff commented Mar 4, 2024

fixing that generates further errors I'm afraid

ValueError: In a nested call() argument, you cannot mix tensors and non-tensors. Received invalid mixed argument: inputs={'pdf_x': <KerasTensor shape=(1, 1, None, 14), dtype=float32, sparse=False, name=keras_tensor_25>, 'pdf_xgrid_integration': <KerasTensor shape=(1, 1, None, 14), dtype=float32, sparse=False, name=keras_tensor_36>, 'xgrid_integration': <KerasTensor shape=(1, None, 1), dtype=float32, sparse=None, name=integration_grid>, 'photon_integral': array([[[0.]]])}

so keras 3 might be breaking quite a lot of stuff...

@APJansen
Copy link
Collaborator Author

APJansen commented Mar 4, 2024

Yes was just about to say that I would expect that...
We are doing many things which are not standard, particularly in MetaModel and MetaLayer, I would expect a lot of things to break.

I think transitioning to Keras 3 with multiple supported backends would be great, especially when all those backend wrappers would be removed (or at least moved down a level). But timing wise perhaps this is not the best moment..

One thing you can check is what happens with the dense-per-flavour layer, but I think the error you quote above would also arise there.

@scarlehoff
Copy link
Member

scarlehoff commented Mar 4, 2024

It is not that bad, the problem right now seem to be

NotImplementedError: numpy() is only available when eager execution is enabled

I'm not sure where that's happening since the traceback is just the training so we are at some point calling something we are not allowed to call (but it should've failed earlier during compilation of the model... it's, at least partially, a tensorflow bug)

Edit: setting eager execution everything work other than the log broken down by experiment because they have changed the reporting but that's minor.

@APJansen
Copy link
Collaborator Author

APJansen commented Mar 4, 2024

numpy is used in op.scatter_to_one in msr_normalization.py, maybe that's it.

@scarlehoff
Copy link
Member

scarlehoff commented Mar 4, 2024

It seems to be inside tensorflow/keras. More specifically it is understanding the preprocessing factor variables as numpy arrays. Which it really seems like a bug in tensorflow in that if there's anything wrong with them they should've been caught some time ago by the time the training starts.

@scarlehoff
Copy link
Member

Indeed. The problem is there only when the preprocessing is trainable.

@scarlehoff
Copy link
Member

In my computer it works for 3.12 but it breaks now for 3.11.

Which I'd say it is ok, now we just have to either have a conditional or fix it in some other way, but it's good :)

Let's see what happens with the rest of the tests.

@scarlehoff
Copy link
Member

The errors are in the np.allclose :)

@scarlehoff
Copy link
Member

(sorry for the spam of commits, I will squash all changes together I've been basically using the CI as the test machine)

@scarlehoff scarlehoff added the run-fit-bot Starts fit bot from a PR. label Mar 4, 2024
Copy link

github-actions bot commented Mar 5, 2024

Greetings from your nice fit 🤖 !
I have good news for you, I just finished my tasks:

Check the report carefully, and please buy me a ☕ , or better, a GPU 😉!

@scarlehoff scarlehoff marked this pull request as ready for review March 5, 2024 07:30
@scarlehoff
Copy link
Member

Please @APJansen when you have time could you make sure that the changes I made are sensible?

@Cmurilochem I had to change the scope of the hyperopt pickle test to check that indeed when starting from a pickle the next first trial does give you the same parameters as if they had been run sequentially... however the results after the first trial change. It's a sub-% change but that's enough to change the following trials.
I've noticed that tf (or maybe keras) has changed the behaviour of the special case seed=0 (which now basically means "get some random seed" and broke some of the multidense tests), I wonder whether this is related.

How important is that, after a restart, the hyperopt continues exactly on the same path after the first trial? If it is important it might be necessary to investigate where this difference is coming from before merging (or blocking hyperopt runs to only use tensorflow <2.16 / python <3.12 / numpy < X.YY... not sure which library is at fault here tbh)

@scarlehoff scarlehoff removed the run-fit-bot Starts fit bot from a PR. label Mar 5, 2024
Copy link
Collaborator Author

@APJansen APJansen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking care of this, I left some comments. (can't approve as I started the PR)

Most importantly, have you checked that these try/except's don't occur inside the training loop? Because I imagine this can cause some significant slowdown, if you're in the except branch.

n3fit/src/n3fit/backends/keras_backend/MetaModel.py Outdated Show resolved Hide resolved
n3fit/src/n3fit/backends/keras_backend/callbacks.py Outdated Show resolved Hide resolved
n3fit/src/n3fit/backends/keras_backend/operations.py Outdated Show resolved Hide resolved
@scarlehoff
Copy link
Member

have you checked that these try/except's don't occur inside the training loop?

It should be compiled away upon first pass but it's true, it might be better to have an if condition to generate the function just in case...

@goord
Copy link
Collaborator

goord commented Mar 5, 2024

How important is that, after a restart, the hyperopt continues exactly on the same path after the first trial? If it is important it might be necessary to investigate where this difference is coming from before merging (or blocking hyperopt runs to only use tensorflow <2.16 / python <3.12 / numpy < X.YY... not sure which library is at fault here tbh)

I would say it's not super-important. The restart was mainly there to get around cluster allocation time limits, reproducability is more of a nice to have.

@scarlehoff
Copy link
Member

Nice! Then @APJansen is it fine if I merge this?

(I'll rebase on top of fk refactor in case there are more 2.16 fixes to be done, better safe than sorry)

@scarlehoff scarlehoff changed the base branch from master to fk-refactor March 5, 2024 13:53
@scarlehoff scarlehoff changed the title Fix multidense compatibility with tf 2.16 Fix multidense compatibility with tf 2.16 and keras 3 Mar 5, 2024
Copy link
Collaborator Author

@APJansen APJansen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes ok by me, apart from the one comment.

n3fit/src/n3fit/stopping.py Outdated Show resolved Hide resolved
@Cmurilochem
Copy link
Collaborator

Cmurilochem commented Mar 5, 2024

Please @APJansen when you have time could you make sure that the changes I made are sensible?

@Cmurilochem I had to change the scope of the hyperopt pickle test to check that indeed when starting from a pickle the next first trial does give you the same parameters as if they had been run sequentially... however the results after the first trial change. It's a sub-% change but that's enough to change the following trials. I've noticed that tf (or maybe keras) has changed the behaviour of the special case seed=0 (which now basically means "get some random seed" and broke some of the multidense tests), I wonder whether this is related.

How important is that, after a restart, the hyperopt continues exactly on the same path after the first trial? If it is important it might be necessary to investigate where this difference is coming from before merging (or blocking hyperopt runs to only use tensorflow <2.16 / python <3.12 / numpy < X.YY... not sure which library is at fault here tbh)

I agree with @goord. But remember that at the time we added some constraints to warrant this, namely:

and

rngs = [np.random.default_rng(seed=seed) for seed in seeds]
seeds = [generator.integers(1, pow(2, 30)) * k for generator in rngs]

But please feel free to change the scope of the test and go over this.

@scarlehoff
Copy link
Member

This problem seems to only appear for python3.12 / tensorflow 2.16 so it might be safer to run hyperopt with <3.12 for the time being just in case then...

Base automatically changed from fk-refactor to master March 5, 2024 16:20
@scarlehoff scarlehoff mentioned this pull request Mar 5, 2024
33 tasks
Copy link
Member

@scarlehoff scarlehoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can have the conda package yet, but at least we can start using the code in py3.12 (and maybe finding other issues!)

APJansen and others added 4 commits March 5, 2024 20:08
kicking down the can

recover previous behaviour

try-except for 3.11

deal with type missmatch

make sure units are int

remove pdb

fix change in how weights are named

Update n3fit/src/n3fit/tests/test_hyperopt.py

0 is understood as None by initializer

change scope of hyperopt test

bugfix

312
@scarlehoff scarlehoff merged commit 0a5fc61 into master Mar 5, 2024
9 checks passed
@scarlehoff scarlehoff deleted the fix-multidense-tf216 branch March 5, 2024 23:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants