-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add support for fp32 kernels #56
Conversation
…eir output tensor
opened an issue with reproduction code here: triton-lang/triton#674 |
# Conflicts: # optimizer/linear.py # test/test_linear_layer.py
measures done from this branch (with autocast)
|
measures done on main (no autocast, full fp16)
|
for memory, model in full fp16 in the autocast branch (so no autocast called):
Shows that the code is as fast as in main when inference is not under autocast context manager. |
@gaetansnl there was a bug in Onnx Runtime, on main it's taking baseline model without setting fp16 to False, so it was working in full fp16 which doesn't work. In this branch there is no such flag and the Onnx model is in mixed precision |
with the weights in fp16 and the model in autocast
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly minor but I have things that I'm not sure to understand
return out | ||
|
||
|
||
def layer_norm(x: torch.Tensor, weight: torch.Tensor, bias: torch.Tensor, eps: float, implementation: JITFunction = _layer_norm_fwd_fused_single_pass): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this one does not work as the other ? "output" is missing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's the opposite? Why attention has an output field? IMO it should be removed, just forgot.
For attention we create the output outside the function, and provide the tensor. The kernel is marked to convert all provided tensors to fp16 which includes the output in mixed precision. We should move it inside the function to avoid this unneeded casting.
Layernorm and linearlayer have not this issue by creating the output tensor of the right type from the begining.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
output is outside because it allows outside code to control allocations, not sure if it's still useful
return outputs | ||
|
||
|
||
def linear_layer(x: torch.Tensor, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here ? why we don't have "output" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see layernorm answer
test/test_attention.py
Outdated
|
||
@pytest.mark.parametrize("batch", [1, 8, 32, 64]) | ||
@pytest.mark.parametrize("implementation", ["torch", "triton_original", "triton"]) | ||
def test_benchmark(benchmark, batch, implementation): | ||
torch.manual_seed(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be moved at the beginning of the function to avoid mistakes IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure to understand what you refer to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
torch.manual_seed(0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved in an annotation
test/test_linear_layer.py
Outdated
@@ -61,7 +53,7 @@ def test_benchmark(benchmark, shape: Shape, bias: bool, activation: str, contigu | |||
else: | |||
raise ValueError(f"Unknown activation: {activation}") | |||
|
|||
torch_linear_layer = torch.nn.Linear(K, N, bias=bias, device="cuda", dtype=torch.float16) | |||
torch_linear_layer = torch.nn.Linear(K, N, bias=bias, device="cuda", dtype=dtype) | |||
torch_linear_layer.weight.data = layer_weight | |||
|
|||
def torch_linear_activation(x): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same, you don't comapre to fp32 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed, ref is now fp32
test/test_torchdynamo_bert.py
Outdated
(32, 16), (32, 128), (32, 256), | ||
], ids=lambda x: f"{x[0]}x{x[1]}") | ||
@pytest.mark.parametrize("shape", [(bs, seq_l) for bs in [1, 8, 32] for seq_l in [16, 128, 256, 384, 512] | ||
if bs * seq_l < 10000], ids=lambda x: f"{x[0]}x{x[1]}") | ||
@pytest.mark.parametrize("implementation", implementations.keys()) | ||
def test_benchmark_implementations(benchmark, model_reference_fp32, shape: (int, int), implementation: str): | ||
torch.manual_seed(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need the assert bellow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which assert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert implementation in implementations, f"unknown implementation: {implementation}"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it has been removed
FYI, tests pass
|
return attention_forward_original(*args, **kwargs) | ||
|
||
|
||
implementations = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pommedeterresautee Maybe we could use other config style to remove this from global scope https://docs.pytest.org/en/6.2.x/example/parametrize.html#paramexamples It makes things really hard to read IMO. And it will be worse as we add test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it's harder to read (it's local and there is the same pattern in all but one test not yet refactored, batched matmul), but I share your point about the fact that we want to control the number of implementations to test (probably at least have light and full flags), and doing it through the command line using pytest is certainly the best way. It is also true for number of shapes/batch sizes to test.
For that reason, that part of the code would be moved outside the global context, but it's not clear for me what it should look like.
It makes me think that it should be done in a dedicated PR, are you ok with that? If ok I write the issue.
fix #39
fix #44
behavior of autocast: https://h-huang.github.io/tutorials/advanced/dispatcher.html#autocast + https://pytorch.org/docs/stable/amp.html