Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FP8 mixed precision via nvidia's Transformer Engine #17172

Closed
carmocca opened this issue Mar 22, 2023 · 6 comments · Fixed by #17597 or #18459
Closed

FP8 mixed precision via nvidia's Transformer Engine #17172

carmocca opened this issue Mar 22, 2023 · 6 comments · Fixed by #17597 or #18459
Labels
fabric lightning.fabric.Fabric feature Is an improvement or enhancement performance pl Generic label for PyTorch Lightning package plugin
Milestone

Comments

@carmocca
Copy link
Member

carmocca commented Mar 22, 2023

Description & Motivation

Support https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html

Pitch

Write a precision plugin using the library above that is enabled via:

  • precision="transformer-engine"

Alternatives

Don't implement this until it's vendored by PyTorch, if that ever happens.

Additional context

No response

cc @Borda @carmocca @justusschock @awaelchli

@carmocca carmocca added feature Is an improvement or enhancement plugin performance labels Mar 22, 2023
@carmocca carmocca added this to the future milestone Mar 22, 2023
@carmocca carmocca added fabric lightning.fabric.Fabric pl Generic label for PyTorch Lightning package labels Mar 22, 2023
@awaelchli
Copy link
Member

awaelchli commented May 10, 2023

The library only requires enabling an autocast context manager

There is one more thing. The user needs to replace their layers with the custom ones from the library. What's the plan here? Will the plugin implement the module_init_context() manager? On the other hand, one might not want to replace all layers. If this is left to the user, then there is a lot less value in adding the plugin.

@carmocca
Copy link
Member Author

carmocca commented May 10, 2023

Yes, we'll need to implement a replacement mechanism. The plugin can have a flag to disable it if necessary

This also means that we'll have it in Fabric first, as these APIs do not exist in the trainer yet.

@carmocca
Copy link
Member Author

Actually convert_module might be a better fit than init_context if we prefer replacing existing layers than patching the torch.nn classes.

@nanand2
Copy link

nanand2 commented Jun 19, 2023

Any update on support for this?

@carmocca
Copy link
Member Author

@nanand2 Our access to H100s is very limited so we haven't merged this yet. However, the branch https://github.com/Lightning-AI/lightning/tree/carmocca/transformer-engine should be usable if you want to play with it right now

@nanand2
Copy link

nanand2 commented Jun 19, 2023

Great, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fabric lightning.fabric.Fabric feature Is an improvement or enhancement performance pl Generic label for PyTorch Lightning package plugin
Projects
None yet
3 participants