Skip to content

Conversation

@scarlehoff
Copy link
Member

@scarlehoff scarlehoff commented Nov 14, 2025

Apparently the "new math" is just being sloppy as hardware instructions: https://blogs.nvidia.com/blog/tensorfloat-32-precision-format/

Note that the above is for GPU (or TPUs) but newer CPUs might be affected by the same issue in case we see it again:
https://www.intel.com/content/dam/develop/external/us/en/documents/lower-numerical-precision-deep-learning-jan2018-754765.pdf

For context here are the results, 251105-jcm-nnpdf41-mhou-001 is the new baseline and these are a few comparisons with a computer with this problem:

a) Default options, tensorfloat 32 is used for some operations with the consequence of a loss of precision
IMAGE 2025-11-14 14:14:30

b) double_precision: true in n3fit. By setting the default type to float64, only some (not clear to me which) operations are downgraded to these tensorfloat types, but everything looks ok

IMAGE 2025-11-14 14:15:40

c) And finally comparison to a fit in which the tensorfloat is disabled and the entire fit is done in single precision, results are ok
IMAGE 2025-11-14 14:09:56

I'll benchmark cases b) and c) to check what is faster in cineca before merging this branch. Or if you prefer we can merge it asap and if during the benchmark we find that option b) is better (run by default in dp but keep these special features on) we can change the default.

NB: I'm only disabling it in TensorFlow, although in principle this is a hardware thing. However, apparently pytorch is more conservative in which operations have it on by default (perhaps that's why it is much slower...) and the results are actually ok. I haven't tested jax.

Copy link
Member

@Radonirinaunimi Radonirinaunimi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested and it works fine for me as well! Everything looks good.

@scarlehoff
Copy link
Member Author

Sadly it does make the runs about 40% slower in the cluster :(

I'll see whether I can find which operations need to be run with extra precision, but I'm afraid it might be the optimisation step itself

@tgiani
Copy link
Contributor

tgiani commented Nov 17, 2025

@scarlehoff do you mean that b) makes runs 40% slower?

@scarlehoff
Copy link
Member Author

Running in Leonardo, 110 replicas, without MHOU:

a) 4180s
b) 5695s
c) 5364s

So the best option is strict float32, which seems to be as precise as float64 for us. Sadly the nice NVIDIA magick by which they get to announce 8x speed ups for machine learning in their presentations is, indeed, much faster, in exchange for the numbers being wrong 😅

But you get them wrong much faster!!!

@scarlehoff
Copy link
Member Author

I'm running a few more tests in case I can isolate some operations for which we can lower the precision, but I think I will merge this as is.

@scarlehoff scarlehoff force-pushed the bugfix_float32 branch 2 times, most recently from 22a3cf7 to 149d410 Compare November 17, 2025 14:01
@scarlehoff scarlehoff merged commit 9ffff73 into master Nov 17, 2025
12 of 13 checks passed
@scarlehoff scarlehoff deleted the bugfix_float32 branch November 17, 2025 16:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants