Disable tensorfloat32 type #2404

scarlehoff · 2025-11-14T13:19:42Z

Apparently the "new math" is just being sloppy as hardware instructions: https://blogs.nvidia.com/blog/tensorfloat-32-precision-format/

Note that the above is for GPU (or TPUs) but newer CPUs might be affected by the same issue in case we see it again:
https://www.intel.com/content/dam/develop/external/us/en/documents/lower-numerical-precision-deep-learning-jan2018-754765.pdf

For context here are the results, 251105-jcm-nnpdf41-mhou-001 is the new baseline and these are a few comparisons with a computer with this problem:

a) Default options, tensorfloat 32 is used for some operations with the consequence of a loss of precision

b) double_precision: true in n3fit. By setting the default type to float64, only some (not clear to me which) operations are downgraded to these tensorfloat types, but everything looks ok

c) And finally comparison to a fit in which the tensorfloat is disabled and the entire fit is done in single precision, results are ok

I'll benchmark cases b) and c) to check what is faster in cineca before merging this branch. Or if you prefer we can merge it asap and if during the benchmark we find that option b) is better (run by default in dp but keep these special features on) we can change the default.

NB: I'm only disabling it in TensorFlow, although in principle this is a hardware thing. However, apparently pytorch is more conservative in which operations have it on by default (perhaps that's why it is much slower...) and the results are actually ok. I haven't tested jax.

Radonirinaunimi

Tested and it works fine for me as well! Everything looks good.

scarlehoff · 2025-11-16T13:58:36Z

Sadly it does make the runs about 40% slower in the cluster :(

I'll see whether I can find which operations need to be run with extra precision, but I'm afraid it might be the optimisation step itself

tgiani · 2025-11-17T09:23:04Z

@scarlehoff do you mean that b) makes runs 40% slower?

scarlehoff · 2025-11-17T09:44:45Z

Running in Leonardo, 110 replicas, without MHOU:

a) 4180s
b) 5695s
c) 5364s

So the best option is strict float32, which seems to be as precise as float64 for us. Sadly the nice NVIDIA magick by which they get to announce 8x speed ups for machine learning in their presentations is, indeed, much faster, in exchange for the numbers being wrong 😅

But you get them wrong much faster!!!

scarlehoff · 2025-11-17T09:46:57Z

I'm running a few more tests in case I can isolate some operations for which we can lower the precision, but I think I will merge this as is.

…ckend in single precision mode

scarlehoff requested review from Radonirinaunimi and tgiani November 14, 2025 13:19

scarlehoff force-pushed the bugfix_float32 branch from 264fd7d to 75d34d4 Compare November 14, 2025 13:22

Radonirinaunimi approved these changes Nov 15, 2025

View reviewed changes

disable tensorfloat32 type by default when running with tensorflow ba…

bacd3aa

…ckend in single precision mode

scarlehoff force-pushed the bugfix_float32 branch 2 times, most recently from 22a3cf7 to 149d410 Compare November 17, 2025 14:01

clean up the worker cache

c08d049

scarlehoff force-pushed the bugfix_float32 branch from 149d410 to c08d049 Compare November 17, 2025 14:33

scarlehoff merged commit 9ffff73 into master Nov 17, 2025
12 of 13 checks passed

scarlehoff deleted the bugfix_float32 branch November 17, 2025 16:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Disable tensorfloat32 type #2404

Disable tensorfloat32 type #2404

Uh oh!

scarlehoff commented Nov 14, 2025 •

edited

Loading

Uh oh!

Radonirinaunimi left a comment

Uh oh!

scarlehoff commented Nov 16, 2025

Uh oh!

tgiani commented Nov 17, 2025

Uh oh!

scarlehoff commented Nov 17, 2025

Uh oh!

scarlehoff commented Nov 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Disable tensorfloat32 type #2404

Disable tensorfloat32 type #2404

Uh oh!

Conversation

scarlehoff commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Radonirinaunimi left a comment

Choose a reason for hiding this comment

Uh oh!

scarlehoff commented Nov 16, 2025

Uh oh!

tgiani commented Nov 17, 2025

Uh oh!

scarlehoff commented Nov 17, 2025

Uh oh!

scarlehoff commented Nov 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

scarlehoff commented Nov 14, 2025 •

edited

Loading