[PyTorch] Recipe heuristics for initializing quantized weights#1827
[PyTorch] Recipe heuristics for initializing quantized weights#1827timmoon10 wants to merge 5 commits intoNVIDIA:mainfrom
Conversation
Signed-off-by: Tim Moon <tmoon@nvidia.com>
|
/te-ci pytorch |
Signed-off-by: Tim Moon <tmoon@nvidia.com>
| update_rowwise_usage = True if quantizer.rowwise_usage else None | ||
| update_columnwise_usage = True if quantizer.columnwise_usage else None | ||
| tensor.update_usage( | ||
| rowwise_usage=update_rowwise_usage, | ||
| columnwise_usage=update_columnwise_usage, | ||
| ) |
There was a problem hiding this comment.
Destroying unnecessary usages was causing problems when alternating between training steps (column-wise data needed) and validation steps (column-wise data not needed). See #1832 (comment).
There was a problem hiding this comment.
TBH this issue is just because of optimizer not doing the right job with quantizing. If we made it so it uses the quantizer then we would not need this part at all.
There was a problem hiding this comment.
The layers will configure the quantizer to avoid unnecessary allocations:
TransformerEngine/transformer_engine/pytorch/module/linear.py
Lines 225 to 233 in 855fa65
This is what we want when allocating new buffers, but is overly aggressive when dealing with an existing
QuantizedTensor. We could remove this logic from get_weight_workspace, but I don't like how it would ignore the configuration within the quantizer.
|
/te-ci pytorch |
| """Configuration for quantization scheme.""" | ||
|
|
||
| # Recipe-specific heuristics (options: "performance", "inference") | ||
| heuristic: str = "performance" |
There was a problem hiding this comment.
The name is not the best - wouldn't you want performance during inference?
There was a problem hiding this comment.
Not necessarily if you're memory constrained.
Perhaps a naming scheme like "training_performance", "inference_performance", "training_memory", "inference_memory" would be more precise?
|
|
||
| # Quantize to FP8 | ||
| assert self._quantizer is not None, "Can't quantize without a quantizer" | ||
| self._quantizer.internal = False |
There was a problem hiding this comment.
This makes me think that internal should maybe be an option to tex.quantize rather than the member of quantizer.
There was a problem hiding this comment.
I have mixed opinions.
- The layers know which tensors can be internal tensors and which must be PyTorch tensors.
internalseems like a usage hint just like whether it needs row-wise/column-wise data. - We override
internalmultiple times, enough to make it feel redundant. These are usually in special cases outside a layer's normal operation (when primary weights are quantized, when settingtensor.data).
Maybe tex.quantize should have an option to force internal=False, but otherwise respect the quantizer's config? This seems a little overcomplicated though.
|
#1847 handles the main issue this PR was addressing: unnecessary memory usage when initializing quantized weights for use in inference. This API is more general though and we may revisit in the future for specialized use-cases, e.g. when it is worth sacrificing performance for reduced memory usage. |
Description
When initializing a model with quantized weights, the required data is different for training and inference (training requires row-wise data for forward GEMM and column-wise data for dgrad GEMM, inference only requires column-wise). This PR adds a
heuristicoption to the quantization recipes, with support for"performance"and"inference". In the future, we may want to consider a"memory"heuristic that prioritizes memory usage over performance.This is similar to the heuristic API in #1300.
Type of change
Changes
Checklist: