Skip to content

Conversation

@KaelanDt
Copy link
Collaborator

@KaelanDt KaelanDt commented Jun 10, 2025

Before submitting
  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?

What does this PR do?

This PR helps first-time users understanding plugins better by adding documentation for plugins, namely the DDP, FSDP, QuantizeInt4, FP8 and ReduceOverhead plugins.

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

cc @Borda @lantiga

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Jun 10, 2025
Copy link
Collaborator

@t-vi t-vi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Supergood to have those documented. Thank you!

@t-vi t-vi merged commit 9656af4 into main Jun 10, 2025
49 checks passed
@t-vi t-vi deleted the kaelan/plugins-docstrings branch June 10, 2025 14:03
This plugin applies the necessary transforms to bucket and synchronize gradients across
multiple processes, using a specified process group for communication.
See https://github.com/pytorch/pytorch/blob/v2.7.0/torch/nn/parallel/distributed.py#L326 for more details.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about cross-referencing DDP instead of referencing pytorch doc of a certain version?

e.g. :class:~torch.nn.parallel.distributed.DistributedDataParallel would work

Comment on lines +22 to +27
Args:
bucket_size_in_mb: float, default 25.0
Size in megabytes of the gradient bucket in DDP.
broadcast_from: int | None, default None
Global rank ID to broadcast model parameters from at initialization. If None, no explicit broadcast is performed.
process_group: Optional[ProcessGroup], default is the current default process group
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type annotation and default values can be obviated looking at
image

Comment on lines +69 to +83
Args:
device: torch.device | None, default None
Device on which to place sharded modules. If None, modules remain on their existing devices.
broadcast_from: int | None, default None
Global rank ID to broadcast parameters from before sharding. If None, no broadcast is performed.
sharding_strategy: FSDPType, default FSDPType.ZERO2
Strategy for parameter sharding (e.g., ZERO2 for sharding both parameters and optimizer state).
bucketing_strategy: FSDPBucketingStrategy, default FSDPBucketingStrategy.NONE
Bucketing strategy to use when saving or loading FSDP checkpoints.
move_state_dict_to_cpu: bool, default False
Whether to move the state dict parameters to CPU after serialization to reduce GPU memory usage.
ddp_bucket_size_in_mb: float, default 25.0
Bucket size in megabytes for the DDP transform when used in a combined mesh with FSDP.
process_group: Optional[ProcessGroup or DeviceMesh], default is the current default process group
The process group or device mesh to use for distributed communication. If None, uses the default process group.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same. I think we can skip type annotations and default values

"""
Plugin for enabling FP8 precision via NVIDIA Transformer Engine, enabling higher throughput of matrix operations in FP8.
See `lightning-thunder/thunder/executors/transformer_engineex.py` for implementation details.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reference this file with a relative path?

model weights, reducing memory footprint and improving
throughput for both training and inference.
See https://github.com/bitsandbytes-foundation/bitsandbytes/blob/main/bitsandbytes/functional.py#L889 for more details.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it'd be better if this link is a permalink

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation lightning-l1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants