-
Notifications
You must be signed in to change notification settings - Fork 110
Add plugins documentation #2207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
t-vi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Supergood to have those documented. Thank you!
| This plugin applies the necessary transforms to bucket and synchronize gradients across | ||
| multiple processes, using a specified process group for communication. | ||
| See https://github.com/pytorch/pytorch/blob/v2.7.0/torch/nn/parallel/distributed.py#L326 for more details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about cross-referencing DDP instead of referencing pytorch doc of a certain version?
e.g. :class:~torch.nn.parallel.distributed.DistributedDataParallel would work
| Args: | ||
| bucket_size_in_mb: float, default 25.0 | ||
| Size in megabytes of the gradient bucket in DDP. | ||
| broadcast_from: int | None, default None | ||
| Global rank ID to broadcast model parameters from at initialization. If None, no explicit broadcast is performed. | ||
| process_group: Optional[ProcessGroup], default is the current default process group |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Args: | ||
| device: torch.device | None, default None | ||
| Device on which to place sharded modules. If None, modules remain on their existing devices. | ||
| broadcast_from: int | None, default None | ||
| Global rank ID to broadcast parameters from before sharding. If None, no broadcast is performed. | ||
| sharding_strategy: FSDPType, default FSDPType.ZERO2 | ||
| Strategy for parameter sharding (e.g., ZERO2 for sharding both parameters and optimizer state). | ||
| bucketing_strategy: FSDPBucketingStrategy, default FSDPBucketingStrategy.NONE | ||
| Bucketing strategy to use when saving or loading FSDP checkpoints. | ||
| move_state_dict_to_cpu: bool, default False | ||
| Whether to move the state dict parameters to CPU after serialization to reduce GPU memory usage. | ||
| ddp_bucket_size_in_mb: float, default 25.0 | ||
| Bucket size in megabytes for the DDP transform when used in a combined mesh with FSDP. | ||
| process_group: Optional[ProcessGroup or DeviceMesh], default is the current default process group | ||
| The process group or device mesh to use for distributed communication. If None, uses the default process group. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same. I think we can skip type annotations and default values
| """ | ||
| Plugin for enabling FP8 precision via NVIDIA Transformer Engine, enabling higher throughput of matrix operations in FP8. | ||
| See `lightning-thunder/thunder/executors/transformer_engineex.py` for implementation details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we reference this file with a relative path?
| model weights, reducing memory footprint and improving | ||
| throughput for both training and inference. | ||
| See https://github.com/bitsandbytes-foundation/bitsandbytes/blob/main/bitsandbytes/functional.py#L889 for more details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it'd be better if this link is a permalink

Before submitting
What does this PR do?
This PR helps first-time users understanding plugins better by adding documentation for plugins, namely the DDP, FSDP, QuantizeInt4, FP8 and ReduceOverhead plugins.
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.
Did you have fun?
Make sure you had fun coding 🙃
cc @Borda @lantiga