-
Notifications
You must be signed in to change notification settings - Fork 434
Tongliu/router fusion #1883
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Tongliu/router fusion #1883
Conversation
Signed-off-by: tongliu <tongliu@nvidia.com>
60d0142
to
b3c3633
Compare
for more information, see https://pre-commit.ci
Signed-off-by: tongliu <tongliu@nvidia.com>
Signed-off-by: tongliu <tongliu@nvidia.com>
4377f92
to
752c351
Compare
tests/pytorch/test_fused_router.py
Outdated
expert_bias=expert_bias_clone, | ||
) | ||
|
||
assert torch.allclose(probs, probs_fused, atol=atol, rtol=rtol) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nicer to replace torch.allclose
with torch.testing.assert_close
. It has a helpful error message and it automatically chooses the tols based on the dtype.
assert torch.allclose(probs, probs_fused, atol=atol, rtol=rtol) | |
torch.testing.assert_close(probs, probs_fused) |
@pytest.mark.parametrize("dtype", [torch.float32]) | ||
@pytest.mark.parametrize("num_tokens", [2048, 7168, 32111]) | ||
@pytest.mark.parametrize("num_experts", [128, 32]) | ||
@pytest.mark.parametrize("topk", [4, 8]) | ||
@pytest.mark.parametrize("group_topk", [None, 4]) | ||
@pytest.mark.parametrize("scaling_factor", [None, 1.2]) | ||
@pytest.mark.parametrize("enable_bias", [True, False]) | ||
def test_topk_sigmoid( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How long does this test suite take to run? The number of test cases grows very quickly if you have one test with many parameters (O(2^n) cases), so it may be better to split it up into multiple tests with only a few parameters (O(2*n) cases).
* \param[in] intermediate_output Intermediate output from the forward pass. (Softmax/sigmoid output) | ||
* \param[in] stream CUDA stream used for the operation. | ||
*/ | ||
void nvte_fused_scores_for_aux_loss_forward(const NVTETensor logits, int num_tokens, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the naming convention for this loss function in other MoE implementations? It feels like aux_loss
is too general and it might get confusing if some other multi-objective training method becomes popular in the future. Maybe something like moe_aux_loss
would be more specific.
/te-ci pytorch |
Signed-off-by: tongliu <tongliu@nvidia.com>
d3197e0
to
baed11c
Compare
for more information, see https://pre-commit.ci
Description
Provide function used in the router fusion and corresponding unit-test.
All 3 parts include the forward and backward.
Fixes # (issue)
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: