Add RoPE to ConformerEncoder#15714
Open
MahmoudAshraf97 wants to merge 1 commit into
Open
Conversation
Co-authored-by: Copilot <copilot@github.com> Signed-off-by: MahmoudAshraf97 <hassouna97.ma@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Important
The
Update branchbutton must only be pressed in very rare occassions.An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.
What does this PR do ?
Adds RoPE as a positional encoding method to ConformerEncoder, it enables the use of torch SDPA while maintaining accuracy
This is a least-friction implementation, there is still path for more improvements (such as flex attention and fused qkv proj) but I preferred to not diverge from other conformer submodules
Collection: asr
Motivation
Since #9590, SDPA caused loss explosion in training when Relative Positional Embeddings were used, investigation showed that SDPA had issues with gradients when attention additive biases were used
as per this paper, RoPe has the same baseline or better performance than RelPos and trains faster because of its ability to use SDPA
Results
I tested two Fastconformer medium models, everything was constant except the positional encoding method (RoPE vs RelPos) and attention implementation (SDPA vs NeMo), both models were randomly initialized, although I forgot to fix the seed.
One takeaway is that when the two models were initialized from a pretrained checkpoint except for the positional encoding parameters, RelPos converged faster, implying that RoPE attention geometry is different than RelPos and not necessarily transferable
Improvements
The Rope was around 6% faster E2E, and used less memory which allowed for higher fused batch size for the RNNT loss and better compute utilization. The training time was dominated by RNNT loss calculation, so it diminishes any benefits from SDPA or
torch.compileRNNT loss compute time is data-dependent, that's why the step time is much lower at the beginning of the run when the model hasn't learned anything yet, once it starts to generate tokens, the time goes up
Changelog
Usage
self_attention_model: ropeto encoder configGitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
cc @pzelasko
Additional Information