[quantization] Make Attention Mask Fill Value Configurable & Run Ablation Study

## Background

In the current wrapper-based PTQ implementation, the attention mask fill value (used for masked positions before softmax) is **hardcoded as `-120`** across multiple modules.

This value is applied in:
- Causal mask template construction
- Conversion from boolean/int masks to additive masks
- Clamp operations for numerical stability

The same constant appears in multiple places:
- `QuantLlamaDecoderLayer`
- `QuantLlamaModel`
- `QuantLlamaAttention`
- ...

While `-120` has empirically shown reasonable performance, it is still a **heuristic** and may not be optimal under quantization.

The attention mask fill value directly affects:
- Softmax suppression strength for masked positions
- Numerical range before quantization / fake-quant
- Interaction with observer statistics in PTQ

Therefore, this should be treated as a **tunable hyperparameter**, not a fixed constant.

Related: https://github.com/Samsung/TICO/pull/642#discussion_r3109837986

## Proposal

### 1. Make Mask Fill Value Configurable

Introduce a global configuration parameter:

```python
attention_mask_fill_value: float = -120.0
```

This should be added to:
- `PTQConfig` (preferred), or
- a shared config used across Llama wrappers

### 2. Apply Consistently Across Codebase

Replace all hardcoded `-120` usages with the configurable value in:

- Causal mask template creation
- Boolean/int mask → additive mask conversion
- Clamp operations (`torch.clamp(..., min=...)`)

## Ablation Study Plan

### 1. Candidate Values (coarse search)

```
[-40, -60, -80, -100, -120, -160, -240]
```

### 2. Evaluation Metrics

- Perplexity (PPL)
- lm-eval tasks (e.g., truthfulqa, etc.)
- Generation sanity check
- (Optional) Separate evaluation for:
  - Prefill
  - Decode

### 3. Follow-up

- Narrow down promising range
- Run finer-grained search if needed

## Expected Outcome

- Identify optimal mask fill value for PTQ setting
- Improve robustness against quantization artifacts
- Remove hidden dependency on hardcoded constants
- Enable future tuning per model / backend

## Notes

- Default value should remain `-120.0` for backward compatibility
- Ensure export paths (if any) also use the same configuration
- This parameter should be treated as a **system-level PTQ hyperparameter**

## TODO

- [x] Add `attention_mask_fill_value` to config
- [x] Refactor all hardcoded `-120` usages
- [ ] Verify no leftover constants exist
- [ ] Run coarse ablation
- [ ] Analyze results and refine range

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[quantization] Make Attention Mask Fill Value Configurable & Run Ablation Study #643

Background

Proposal

1. Make Mask Fill Value Configurable

2. Apply Consistently Across Codebase

Ablation Study Plan

1. Candidate Values (coarse search)

2. Evaluation Metrics

3. Follow-up

Expected Outcome

Notes

TODO

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[quantization] Make Attention Mask Fill Value Configurable & Run Ablation Study #643

Description

Background

Proposal

1. Make Mask Fill Value Configurable

2. Apply Consistently Across Codebase

Ablation Study Plan

1. Candidate Values (coarse search)

2. Evaluation Metrics

3. Follow-up

Expected Outcome

Notes

TODO

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions