🐞 Describe the Bug
Model comparison tests (aka run_test_script) wit have recently started showing random failures with excessive diff on word_embeddings_weight gradients (>>>> [train_2] Excessive diff for tensor Global gradient: layers.0.word_embeddings_weight), with diffs slightly above the threshold. We need to investigate whether there is an actual bug/regression behind this or if it's just random.
Example:
>>>> [train_2] Excessive diff for tensor Global gradient: layers.0.word_embeddings_weight:
* Max diff scaled = 0.15082430839538574 > 0.15 (scale=0.001214031595736742, unregularized=0.0006883841124363244)
Ref samples: 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 2.9449e-03
Test samples: 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 2.9182e-03