Background
In the current wrapper-based PTQ implementation, the attention mask fill value (used for masked positions before softmax) is hardcoded as -120 across multiple modules.
This value is applied in:
- Causal mask template construction
- Conversion from boolean/int masks to additive masks
- Clamp operations for numerical stability
The same constant appears in multiple places:
QuantLlamaDecoderLayer
QuantLlamaModel
QuantLlamaAttention
- ...
While -120 has empirically shown reasonable performance, it is still a heuristic and may not be optimal under quantization.
The attention mask fill value directly affects:
- Softmax suppression strength for masked positions
- Numerical range before quantization / fake-quant
- Interaction with observer statistics in PTQ
Therefore, this should be treated as a tunable hyperparameter, not a fixed constant.
Related: #642 (comment)
Proposal
1. Make Mask Fill Value Configurable
Introduce a global configuration parameter:
attention_mask_fill_value: float = -120.0
This should be added to:
PTQConfig (preferred), or
- a shared config used across Llama wrappers
2. Apply Consistently Across Codebase
Replace all hardcoded -120 usages with the configurable value in:
- Causal mask template creation
- Boolean/int mask → additive mask conversion
- Clamp operations (
torch.clamp(..., min=...))
Ablation Study Plan
1. Candidate Values (coarse search)
[-40, -60, -80, -100, -120, -160, -240]
2. Evaluation Metrics
- Perplexity (PPL)
- lm-eval tasks (e.g., truthfulqa, etc.)
- Generation sanity check
- (Optional) Separate evaluation for:
3. Follow-up
- Narrow down promising range
- Run finer-grained search if needed
Expected Outcome
- Identify optimal mask fill value for PTQ setting
- Improve robustness against quantization artifacts
- Remove hidden dependency on hardcoded constants
- Enable future tuning per model / backend
Notes
- Default value should remain
-120.0 for backward compatibility
- Ensure export paths (if any) also use the same configuration
- This parameter should be treated as a system-level PTQ hyperparameter
TODO
Background
In the current wrapper-based PTQ implementation, the attention mask fill value (used for masked positions before softmax) is hardcoded as
-120across multiple modules.This value is applied in:
The same constant appears in multiple places:
QuantLlamaDecoderLayerQuantLlamaModelQuantLlamaAttentionWhile
-120has empirically shown reasonable performance, it is still a heuristic and may not be optimal under quantization.The attention mask fill value directly affects:
Therefore, this should be treated as a tunable hyperparameter, not a fixed constant.
Related: #642 (comment)
Proposal
1. Make Mask Fill Value Configurable
Introduce a global configuration parameter:
This should be added to:
PTQConfig(preferred), or2. Apply Consistently Across Codebase
Replace all hardcoded
-120usages with the configurable value in:torch.clamp(..., min=...))Ablation Study Plan
1. Candidate Values (coarse search)
2. Evaluation Metrics
3. Follow-up
Expected Outcome
Notes
-120.0for backward compatibilityTODO
attention_mask_fill_valueto config-120usages