Fix multidense compatibility with tf 2.16 and keras 3 #1975

APJansen · 2024-03-04T08:14:04Z

No description provided.

APJansen · 2024-03-04T09:51:32Z

@scarlehoff I'm not sure this fix works, and 3.12 isn't in the CI yet, do you have a PR which has the new test where I could push this to (or rebase this on)? I couldn't find it.

scarlehoff · 2024-03-04T10:07:55Z

I have added 3.12 to the tests. It might fail because of other reasons of course (if it runs!), but let's see whether this one is fixed and then revert that commit.

scarlehoff · 2024-03-04T10:18:22Z

When trying it in my computer I got this error though:

/NNPDF/src/nnpdf/n3fit/src/n3fit/model_gen.py", line 789, in generate_nn
    pdfs = layer(pdfs)
           ^^^^^^^^^^^
/NNPDF/src/nnpdf/n3fit/src/n3fit/backends/keras_backend/multi_dense.py", line 128,
   output_shape = output_shape[:1] + [self.replicas] + output_shape[1:]
                   ~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~
TypeError: Exception encountered when calling MultiDense.call().
can only concatenate tuple (not "list") to tuple

Arguments received by MultiDense.call():
  • args=('<KerasTensor shape=(1, None, 2), dtype=float32, sparse=None, name=xgrids_processed>',)
  • kwargs=<class 'inspect._empty'>

scarlehoff · 2024-03-04T10:23:11Z

fixing that generates further errors I'm afraid

ValueError: In a nested call() argument, you cannot mix tensors and non-tensors. Received invalid mixed argument: inputs={'pdf_x': <KerasTensor shape=(1, 1, None, 14), dtype=float32, sparse=False, name=keras_tensor_25>, 'pdf_xgrid_integration': <KerasTensor shape=(1, 1, None, 14), dtype=float32, sparse=False, name=keras_tensor_36>, 'xgrid_integration': <KerasTensor shape=(1, None, 1), dtype=float32, sparse=None, name=integration_grid>, 'photon_integral': array([[[0.]]])}

so keras 3 might be breaking quite a lot of stuff...

APJansen · 2024-03-04T10:28:36Z

Yes was just about to say that I would expect that...
We are doing many things which are not standard, particularly in MetaModel and MetaLayer, I would expect a lot of things to break.

I think transitioning to Keras 3 with multiple supported backends would be great, especially when all those backend wrappers would be removed (or at least moved down a level). But timing wise perhaps this is not the best moment..

One thing you can check is what happens with the dense-per-flavour layer, but I think the error you quote above would also arise there.

scarlehoff · 2024-03-04T11:06:44Z

It is not that bad, the problem right now seem to be

NotImplementedError: numpy() is only available when eager execution is enabled

I'm not sure where that's happening since the traceback is just the training so we are at some point calling something we are not allowed to call (but it should've failed earlier during compilation of the model... it's, at least partially, a tensorflow bug)

Edit: setting eager execution everything work other than the log broken down by experiment because they have changed the reporting but that's minor.

APJansen · 2024-03-04T11:41:06Z

numpy is used in op.scatter_to_one in msr_normalization.py, maybe that's it.

scarlehoff · 2024-03-04T13:02:42Z

It seems to be inside tensorflow/keras. More specifically it is understanding the preprocessing factor variables as numpy arrays. Which it really seems like a bug in tensorflow in that if there's anything wrong with them they should've been caught some time ago by the time the training starts.

scarlehoff · 2024-03-04T13:14:11Z

Indeed. The problem is there only when the preprocessing is trainable.

scarlehoff · 2024-03-04T13:41:03Z

In my computer it works for 3.12 but it breaks now for 3.11.

Which I'd say it is ok, now we just have to either have a conditional or fix it in some other way, but it's good :)

Let's see what happens with the rest of the tests.

scarlehoff · 2024-03-04T15:34:08Z

The errors are in the np.allclose :)

scarlehoff · 2024-03-04T16:28:55Z

(sorry for the spam of commits, I will squash all changes together I've been basically using the CI as the test machine)

n3fit/src/n3fit/tests/test_hyperopt.py

github-actions · 2024-03-05T01:03:15Z

Greetings from your nice fit 🤖 !
I have good news for you, I just finished my tasks:

Fit Name: NNBOT-0a7d27e2f-2024-03-04
Fit Report wrt master: https://vp.nnpdf.science/OcMBAjoWRSq2vMro_M_gzA==
Fit Report wrt latest stable reference: https://vp.nnpdf.science/dvehhiHPRviA3_ruNnPj_Q==
Fit Data: https://data.nnpdf.science/fits/NNBOT-0a7d27e2f-2024-03-04.tar.gz

Check the report carefully, and please buy me a ☕ , or better, a GPU 😉!

scarlehoff · 2024-03-05T07:36:13Z

Please @APJansen when you have time could you make sure that the changes I made are sensible?

@Cmurilochem I had to change the scope of the hyperopt pickle test to check that indeed when starting from a pickle the next first trial does give you the same parameters as if they had been run sequentially... however the results after the first trial change. It's a sub-% change but that's enough to change the following trials.
I've noticed that tf (or maybe keras) has changed the behaviour of the special case seed=0 (which now basically means "get some random seed" and broke some of the multidense tests), I wonder whether this is related.

How important is that, after a restart, the hyperopt continues exactly on the same path after the first trial? If it is important it might be necessary to investigate where this difference is coming from before merging (or blocking hyperopt runs to only use tensorflow <2.16 / python <3.12 / numpy < X.YY... not sure which library is at fault here tbh)

APJansen

Thanks for taking care of this, I left some comments. (can't approve as I started the PR)

Most importantly, have you checked that these try/except's don't occur inside the training loop? Because I imagine this can cause some significant slowdown, if you're in the except branch.

n3fit/src/n3fit/backends/keras_backend/MetaModel.py

n3fit/src/n3fit/backends/keras_backend/callbacks.py

n3fit/src/n3fit/backends/keras_backend/operations.py

scarlehoff · 2024-03-05T08:46:16Z

have you checked that these try/except's don't occur inside the training loop?

It should be compiled away upon first pass but it's true, it might be better to have an if condition to generate the function just in case...

goord · 2024-03-05T12:43:45Z

How important is that, after a restart, the hyperopt continues exactly on the same path after the first trial? If it is important it might be necessary to investigate where this difference is coming from before merging (or blocking hyperopt runs to only use tensorflow <2.16 / python <3.12 / numpy < X.YY... not sure which library is at fault here tbh)

I would say it's not super-important. The restart was mainly there to get around cluster allocation time limits, reproducability is more of a nice to have.

scarlehoff · 2024-03-05T13:13:21Z

Nice! Then @APJansen is it fine if I merge this?

(I'll rebase on top of fk refactor in case there are more 2.16 fixes to be done, better safe than sorry)

APJansen

Yes ok by me, apart from the one comment.

n3fit/src/n3fit/stopping.py

Cmurilochem · 2024-03-05T15:34:48Z

Please @APJansen when you have time could you make sure that the changes I made are sensible?

@Cmurilochem I had to change the scope of the hyperopt pickle test to check that indeed when starting from a pickle the next first trial does give you the same parameters as if they had been run sequentially... however the results after the first trial change. It's a sub-% change but that's enough to change the following trials. I've noticed that tf (or maybe keras) has changed the behaviour of the special case seed=0 (which now basically means "get some random seed" and broke some of the multidense tests), I wonder whether this is related.

How important is that, after a restart, the hyperopt continues exactly on the same path after the first trial? If it is important it might be necessary to investigate where this difference is coming from before merging (or blocking hyperopt runs to only use tensorflow <2.16 / python <3.12 / numpy < X.YY... not sure which library is at fault here tbh)

I agree with @goord. But remember that at the time we added some constraints to warrant this, namely:

nnpdf/n3fit/src/n3fit/hyper_optimization/hyper_scan.py

Line 27 in 7babfe2

HYPEROPT_SEED = 42

and

nnpdf/n3fit/src/n3fit/model_trainer.py

Lines 876 to 877 in 7babfe2

    
           rngs = [np.random.default_rng(seed=seed) for seed in seeds] 
        
           seeds = [generator.integers(1, pow(2, 30)) * k for generator in rngs]

But please feel free to change the scope of the test and go over this.

scarlehoff · 2024-03-05T15:45:14Z

This problem seems to only appear for python3.12 / tensorflow 2.16 so it might be safer to run hyperopt with <3.12 for the time being just in case then...

scarlehoff

I don't think we can have the conda package yet, but at least we can start using the code in py3.12 (and maybe finding other issues!)

kicking down the can recover previous behaviour try-except for 3.11 deal with type missmatch make sure units are int remove pdb fix change in how weights are named Update n3fit/src/n3fit/tests/test_hyperopt.py 0 is understood as None by initializer change scope of hyperopt test bugfix 312

…pute losses

scarlehoff mentioned this pull request Mar 4, 2024

Changes needed to enable Python 3.12 #1970

Closed

3 tasks

scarlehoff reviewed Mar 4, 2024

View reviewed changes

n3fit/src/n3fit/tests/test_hyperopt.py Outdated Show resolved Hide resolved

scarlehoff force-pushed the fix-multidense-tf216 branch from 97739b7 to db760e4 Compare March 4, 2024 22:51

scarlehoff added the run-fit-bot Starts fit bot from a PR. label Mar 4, 2024

scarlehoff marked this pull request as ready for review March 5, 2024 07:30

scarlehoff removed the run-fit-bot Starts fit bot from a PR. label Mar 5, 2024

APJansen commented Mar 5, 2024

View reviewed changes

n3fit/src/n3fit/backends/keras_backend/MetaModel.py Outdated Show resolved Hide resolved

n3fit/src/n3fit/backends/keras_backend/callbacks.py Outdated Show resolved Hide resolved

n3fit/src/n3fit/backends/keras_backend/operations.py Outdated Show resolved Hide resolved

scarlehoff force-pushed the fix-multidense-tf216 branch from 47194ed to 6571169 Compare March 5, 2024 13:52

scarlehoff changed the base branch from master to fk-refactor March 5, 2024 13:53

scarlehoff changed the title ~~Fix multidense compatibility with tf 2.16~~ Fix multidense compatibility with tf 2.16 and keras 3 Mar 5, 2024

APJansen commented Mar 5, 2024

View reviewed changes

n3fit/src/n3fit/stopping.py Outdated Show resolved Hide resolved

scarlehoff force-pushed the fk-refactor branch from b35a4a9 to f5530d5 Compare March 5, 2024 14:21

scarlehoff force-pushed the fix-multidense-tf216 branch from 6571169 to 7babfe2 Compare March 5, 2024 14:44

Base automatically changed from fk-refactor to master March 5, 2024 16:20

scarlehoff mentioned this pull request Mar 5, 2024

To Do for 4.0.10 #1854

Open

33 tasks

scarlehoff approved these changes Mar 5, 2024

View reviewed changes

APJansen and others added 4 commits March 5, 2024 20:08

Rename kernel to multi_kernel

43aeb42

list->tuple in output shape

2c9ce24

change the per-100-epochs monitoring of chi2 to avoid having to recom…

9e6dd2c

…pute losses

scarlehoff force-pushed the fix-multidense-tf216 branch from 7babfe2 to 9e6dd2c Compare March 5, 2024 19:08

scarlehoff merged commit 0a5fc61 into master Mar 5, 2024
9 checks passed

scarlehoff deleted the fix-multidense-tf216 branch March 5, 2024 23:12

RoyStegeman mentioned this pull request Jul 11, 2024

No agreement between parallel gpu and sequential cpu fits #2118

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix multidense compatibility with tf 2.16 and keras 3 #1975

Fix multidense compatibility with tf 2.16 and keras 3 #1975

APJansen commented Mar 4, 2024

APJansen commented Mar 4, 2024

scarlehoff commented Mar 4, 2024

scarlehoff commented Mar 4, 2024 •

edited

Loading

scarlehoff commented Mar 4, 2024 •

edited

Loading

APJansen commented Mar 4, 2024

scarlehoff commented Mar 4, 2024 •

edited

Loading

APJansen commented Mar 4, 2024

scarlehoff commented Mar 4, 2024 •

edited

Loading

scarlehoff commented Mar 4, 2024

scarlehoff commented Mar 4, 2024

scarlehoff commented Mar 4, 2024

scarlehoff commented Mar 4, 2024

github-actions bot commented Mar 5, 2024

scarlehoff commented Mar 5, 2024

APJansen left a comment

scarlehoff commented Mar 5, 2024

goord commented Mar 5, 2024

scarlehoff commented Mar 5, 2024

APJansen left a comment

Cmurilochem commented Mar 5, 2024 •

edited

Loading

scarlehoff commented Mar 5, 2024

scarlehoff left a comment

Fix multidense compatibility with tf 2.16 and keras 3 #1975

Fix multidense compatibility with tf 2.16 and keras 3 #1975

Conversation

APJansen commented Mar 4, 2024

APJansen commented Mar 4, 2024

scarlehoff commented Mar 4, 2024

scarlehoff commented Mar 4, 2024 • edited Loading

scarlehoff commented Mar 4, 2024 • edited Loading

APJansen commented Mar 4, 2024

scarlehoff commented Mar 4, 2024 • edited Loading

APJansen commented Mar 4, 2024

scarlehoff commented Mar 4, 2024 • edited Loading

scarlehoff commented Mar 4, 2024

scarlehoff commented Mar 4, 2024

scarlehoff commented Mar 4, 2024

scarlehoff commented Mar 4, 2024

github-actions bot commented Mar 5, 2024

scarlehoff commented Mar 5, 2024

APJansen left a comment

Choose a reason for hiding this comment

scarlehoff commented Mar 5, 2024

goord commented Mar 5, 2024

scarlehoff commented Mar 5, 2024

APJansen left a comment

Choose a reason for hiding this comment

Cmurilochem commented Mar 5, 2024 • edited Loading

scarlehoff commented Mar 5, 2024

scarlehoff left a comment

Choose a reason for hiding this comment

scarlehoff commented Mar 4, 2024 •

edited

Loading

scarlehoff commented Mar 4, 2024 •

edited

Loading

scarlehoff commented Mar 4, 2024 •

edited

Loading

scarlehoff commented Mar 4, 2024 •

edited

Loading

Cmurilochem commented Mar 5, 2024 •

edited

Loading