Skip to content

[PyTorch] Recipe heuristics for initializing quantized weights#1827

Closed
timmoon10 wants to merge 5 commits intoNVIDIA:mainfrom
timmoon10:quantized-param-heuristics
Closed

[PyTorch] Recipe heuristics for initializing quantized weights#1827
timmoon10 wants to merge 5 commits intoNVIDIA:mainfrom
timmoon10:quantized-param-heuristics

Conversation

@timmoon10
Copy link
Copy Markdown
Collaborator

Description

When initializing a model with quantized weights, the required data is different for training and inference (training requires row-wise data for forward GEMM and column-wise data for dgrad GEMM, inference only requires column-wise). This PR adds a heuristic option to the quantization recipes, with support for "performance" and "inference". In the future, we may want to consider a "memory" heuristic that prioritizes memory usage over performance.

This is similar to the heuristic API in #1300.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

  • Add recipe heuristics for initializing quantized weights

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Signed-off-by: Tim Moon <tmoon@nvidia.com>
@timmoon10 timmoon10 requested review from ksivaman and ptrendx May 27, 2025 22:16
@timmoon10
Copy link
Copy Markdown
Collaborator Author

/te-ci pytorch

@timmoon10 timmoon10 mentioned this pull request May 29, 2025
12 tasks
Comment on lines +1262 to +1267
update_rowwise_usage = True if quantizer.rowwise_usage else None
update_columnwise_usage = True if quantizer.columnwise_usage else None
tensor.update_usage(
rowwise_usage=update_rowwise_usage,
columnwise_usage=update_columnwise_usage,
)
Copy link
Copy Markdown
Collaborator Author

@timmoon10 timmoon10 May 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Destroying unnecessary usages was causing problems when alternating between training steps (column-wise data needed) and validation steps (column-wise data not needed). See #1832 (comment).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH this issue is just because of optimizer not doing the right job with quantizing. If we made it so it uses the quantizer then we would not need this part at all.

Copy link
Copy Markdown
Collaborator Author

@timmoon10 timmoon10 May 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The layers will configure the quantizer to avoid unnecessary allocations:

# Configure quantizer
if weight_quantizer is not None:
columnwise_usage = is_grad_enabled and inp.requires_grad
if not columnwise_usage:
columnwise_usage = (
is_fp8_activation_recompute_enabled()
and not in_fp8_activation_recompute_phase()
)
weight_quantizer.set_usage(rowwise=True, columnwise=columnwise_usage)

This is what we want when allocating new buffers, but is overly aggressive when dealing with an existing QuantizedTensor. We could remove this logic from get_weight_workspace, but I don't like how it would ignore the configuration within the quantizer.

@timmoon10
Copy link
Copy Markdown
Collaborator Author

/te-ci pytorch

"""Configuration for quantization scheme."""

# Recipe-specific heuristics (options: "performance", "inference")
heuristic: str = "performance"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is not the best - wouldn't you want performance during inference?

Copy link
Copy Markdown
Collaborator Author

@timmoon10 timmoon10 May 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily if you're memory constrained.

Perhaps a naming scheme like "training_performance", "inference_performance", "training_memory", "inference_memory" would be more precise?


# Quantize to FP8
assert self._quantizer is not None, "Can't quantize without a quantizer"
self._quantizer.internal = False
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me think that internal should maybe be an option to tex.quantize rather than the member of quantizer.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have mixed opinions.

  • The layers know which tensors can be internal tensors and which must be PyTorch tensors. internal seems like a usage hint just like whether it needs row-wise/column-wise data.
  • We override internal multiple times, enough to make it feel redundant. These are usually in special cases outside a layer's normal operation (when primary weights are quantized, when setting tensor.data).

Maybe tex.quantize should have an option to force internal=False, but otherwise respect the quantizer's config? This seems a little overcomplicated though.

@timmoon10
Copy link
Copy Markdown
Collaborator Author

#1847 handles the main issue this PR was addressing: unnecessary memory usage when initializing quantized weights for use in inference. This API is more general though and we may revisit in the future for specialized use-cases, e.g. when it is worth sacrificing performance for reduced memory usage.

@timmoon10 timmoon10 closed this Jun 13, 2025
@timmoon10 timmoon10 mentioned this pull request Jun 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants