Skip to content

[quantization] Make Attention Mask Fill Value Configurable & Run Ablation Study #643

@mhs4670go

Description

@mhs4670go

Background

In the current wrapper-based PTQ implementation, the attention mask fill value (used for masked positions before softmax) is hardcoded as -120 across multiple modules.

This value is applied in:

  • Causal mask template construction
  • Conversion from boolean/int masks to additive masks
  • Clamp operations for numerical stability

The same constant appears in multiple places:

  • QuantLlamaDecoderLayer
  • QuantLlamaModel
  • QuantLlamaAttention
  • ...

While -120 has empirically shown reasonable performance, it is still a heuristic and may not be optimal under quantization.

The attention mask fill value directly affects:

  • Softmax suppression strength for masked positions
  • Numerical range before quantization / fake-quant
  • Interaction with observer statistics in PTQ

Therefore, this should be treated as a tunable hyperparameter, not a fixed constant.

Related: #642 (comment)

Proposal

1. Make Mask Fill Value Configurable

Introduce a global configuration parameter:

attention_mask_fill_value: float = -120.0

This should be added to:

  • PTQConfig (preferred), or
  • a shared config used across Llama wrappers

2. Apply Consistently Across Codebase

Replace all hardcoded -120 usages with the configurable value in:

  • Causal mask template creation
  • Boolean/int mask → additive mask conversion
  • Clamp operations (torch.clamp(..., min=...))

Ablation Study Plan

1. Candidate Values (coarse search)

[-40, -60, -80, -100, -120, -160, -240]

2. Evaluation Metrics

  • Perplexity (PPL)
  • lm-eval tasks (e.g., truthfulqa, etc.)
  • Generation sanity check
  • (Optional) Separate evaluation for:
    • Prefill
    • Decode

3. Follow-up

  • Narrow down promising range
  • Run finer-grained search if needed

Expected Outcome

  • Identify optimal mask fill value for PTQ setting
  • Improve robustness against quantization artifacts
  • Remove hidden dependency on hardcoded constants
  • Enable future tuning per model / backend

Notes

  • Default value should remain -120.0 for backward compatibility
  • Ensure export paths (if any) also use the same configuration
  • This parameter should be treated as a system-level PTQ hyperparameter

TODO

  • Add attention_mask_fill_value to config
  • Refactor all hardcoded -120 usages
  • Verify no leftover constants exist
  • Run coarse ablation
  • Analyze results and refine range

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions