-
Notifications
You must be signed in to change notification settings - Fork 38
Description
🐞 Describe the Bug
The generated tokens from Fast-LLM occasionally differ completely from the Hugging Face (HF) counterpart. HF consistently generates the same output, so the issue likely lies in Fast-LLM’s training or inference.
The test uses a small untrained model, trains it for several iterations, and saves it in HF format. The same model is then loaded by both HF Transformers and Fast-LLM. A generate call is made on random input data. Occasionally, Fast-LLM produces a different first token compared to HF.
So far, it has not happened on the officially supported image, but since the behavior appears to be random, we cannot be certain.
🔄 Steps to Reproduce
test_gpt_generate_and_forward.py:test_small_generate
Discussion abut the issue:
The generate tests are passing most of the time on the dummy model, but occasionally one of them produces a completely different output. I'm using the same seed to generate the dummy input data, so input randomness shouldn't be the cause.
The issue is non-reproducible — for example, I can run the tests 20 times in a row without any failures, and then on the next run, one of the tests fails. After that, it usually passes again.
This happens only with fast_llm; the Hugging Face model produces consistent outputs in the same scenarios. It also occurs with batch size 1 and when running in BF16 mode. It might also be something related to my development environment — I currently have three unrelated Megatron tests failing as well.
What do you think — is it worth investigating the source of this non-determinism now, or should we leave it for later and just file an issue to track it?
Then I looked into the hidden states. The embeddings are exactly the same up to the default epsilon, but starting from the first transformer layer, the outputs begin to diverge slightly. Interestingly, the final normalized logits are almost identical again, which results in the same predicted token ID.
We most certainly have different kernels for the same operations — that’s expected between fast_llm and HF implementations. But I’m unsure how to decide which differences are acceptable, and which ones might cause these occasional spikes in output divergence.
Right now, I don’t have a clear idea of how to systematically pinpoint the source of this instability.
The only reliable way I see right now is to save each intermediate tensor from both the HF implementation and ours, then run the code multiple times until we hit a failed test. Once we capture a divergence case, we can compare tensors step by step to see where the differences start to increase abruptly — that would likely point to the faulting operation or kernel.
What do you think? Would this be a reasonable next step, or is there a more efficient debugging approach we could try first?
As we already discussed, a small difference between Fast-LLM and Hugging face is perfectly normal up to the logits. Once we get to the token ids there is no concept of small, so a small difference in logit can change the predicted token, and then the whole sequence afterwards. So you're right to look at intermediate states, that's the only thing we can compare.
By the way, that means you can't really test the match between predicted sequences. All that's relevant is the predicted logits being close enough for the same input sequence (including generated tokens).
If I understand what you're describing we should also be able to take Hugging Face out or the equation. The problem seems to be Fast-LLM generating different outputs for the exact same model and input, which would be a bug. So I recommend checking if this happens by running the same thing multiple times and making sure the logits and hidden states stay exactly the same. The debug tensor logs might be useful for it.