[quantization] Calibration Dataset Selection & Sampling Strategy

## 1. Dataset Candidates

We limit the candidates to **practical, widely-used, and easy-to-integrate datasets**.

### 1.1 LLM (LLaMA)

#### (A) Baseline
- WikiText-2
  - Purpose: baseline comparison
  - Characteristics: plain natural language

#### (B) Instruction / Chat
- Alpaca (Stanford Alpaca)
  - Instruction-following format
  - Single-turn prompt-response pairs

- ShareGPT (filtered subset)
  - Multi-turn conversations
  - Closer to real chat usage

#### (C) Structured Tasks
- FLAN-style mixtures (optional subset)
  - Includes QA, reasoning, summarization
  - Good for diverse activation patterns

---

### 1.2 VLM (Qwen3-VL)

#### (A) Text-only baseline
- Same as LLM (WikiText / Alpaca subset)

#### (B) Vision-Language
- COCO Captions
  - Image → caption mapping
  - Simple, stable baseline

- VQAv2
  - Image + question → answer
  - Introduces cross-modal attention

- Instruction-style VLM data (if available)
  - Image-grounded instructions
  - Closest to real usage

## 2. Sampling Strategy

### 2.1 Number of Samples

Recommended:
- 512 ~ 2048 samples total

Guideline:
- Too small → unstable calibration
- Too large → diminishing returns

### 2.2 Sequence Length

We should match **deployment characteristics**:

| Scenario            | Strategy                          |
|--------------------|----------------------------------|
| Chat / QA          | 128 ~ 512 tokens                 |
| Long context       | include some 1K+ samples         |
| Mixed              | stratified by length             |

### 2.3 Sampling Method

#### Option A (Simple Random)
- Uniform random sampling
- Fast and easy baseline

#### Option B (Recommended: Stratified)
Split by:
- Sequence length buckets
- Prompt types (instruction / plain / QA)

Example:
- 50% instruction/chat
- 30% general text
- 20% long-context samples

### 2.4 Prompt Formatting (Important)

For instruction models, use **actual inference format**:

Example:

```
### Instruction:
<instruction>

### Response:
<response>
```

or chat template:

```
<|system|>
...
<|user|>
...
<|assistant|>
...
```

Mismatch here can significantly affect activation distribution.

## 3. VLM Input Construction

For Qwen3-VL:

Each sample should include:
- Image (or dummy image if needed)
- Text prompt

Example:

```
User: What is happening in this image?
<image>
```

Important:
- Maintain **real inference preprocessing**
- Use actual tokenizer + image processor

## 4. Calibration Execution Details

### 4.1 Prefill + Decode Coverage

Ensure calibration includes:
- Prefill (full sequence)
- Short decode steps (important for KV cache behavior)

Example:
- Run 1 full forward (prefill)
- Run 2~4 decode steps

### 4.2 Token Distribution Coverage

We want to expose:
- Special tokens (BOS, EOS, role tokens)
- Punctuation-heavy inputs
- Rare tokens (optional but helpful)

## 5. Ablation Plan

| Dataset         | Expected Outcome |
|----------------|----------------|
| WikiText       | Baseline        |
| Alpaca         | Better instruction alignment |
| ShareGPT       | Better chat realism |
| COCO (VLM)     | Basic multimodal alignment |
| VQAv2 (VLM)    | Complex cross-modal |

## 6. Evaluation Focus

We prioritize:

1. Perplexity delta vs FP
2. lm-eval tasks (subset)
3. Representative prompts
4. Qualitative output stability
5. (Optional) Layer-wise activation error

## 7. Recommended Default (Initial Guess)

If we had to pick a strong default:

### LLM
- 70% Alpaca
- 30% WikiText

### VLM
- 50% COCO
- 30% Alpaca-style text
- 20% VQAv2

## 8. Risks & Considerations

- Overfitting calibration to specific formats
- Dataset preprocessing mismatch
- Ignoring decode-phase behavior
- Too homogeneous sampling

## 9. Next Actions

- [ ] Implement dataset loaders
- [ ] Add sampling pipeline (random + stratified)
- [ ] Integrate prompt formatting
- [ ] Run PTQ calibration across datasets
- [ ] Compare results and finalize default


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[quantization] Calibration Dataset Selection & Sampling Strategy #645

1. Dataset Candidates

1.1 LLM (LLaMA)

(A) Baseline

(B) Instruction / Chat

(C) Structured Tasks

1.2 VLM (Qwen3-VL)

(A) Text-only baseline

(B) Vision-Language

2. Sampling Strategy

2.1 Number of Samples

2.2 Sequence Length

2.3 Sampling Method

Option A (Simple Random)

Option B (Recommended: Stratified)

2.4 Prompt Formatting (Important)

3. VLM Input Construction

4. Calibration Execution Details

4.1 Prefill + Decode Coverage

4.2 Token Distribution Coverage

5. Ablation Plan

6. Evaluation Focus

7. Recommended Default (Initial Guess)

LLM

VLM

8. Risks & Considerations

9. Next Actions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Scenario	Strategy
Chat / QA	128 ~ 512 tokens
Long context	include some 1K+ samples
Mixed	stratified by length

Dataset	Expected Outcome
WikiText	Baseline
Alpaca	Better instruction alignment
ShareGPT	Better chat realism
COCO (VLM)	Basic multimodal alignment
VQAv2 (VLM)	Complex cross-modal

[quantization] Calibration Dataset Selection & Sampling Strategy #645

Description

1. Dataset Candidates

1.1 LLM (LLaMA)

(A) Baseline

(B) Instruction / Chat

(C) Structured Tasks

1.2 VLM (Qwen3-VL)

(A) Text-only baseline

(B) Vision-Language

2. Sampling Strategy

2.1 Number of Samples

2.2 Sequence Length

2.3 Sampling Method

Option A (Simple Random)

Option B (Recommended: Stratified)

2.4 Prompt Formatting (Important)

3. VLM Input Construction

4. Calibration Execution Details

4.1 Prefill + Decode Coverage

4.2 Token Distribution Coverage

5. Ablation Plan

6. Evaluation Focus

7. Recommended Default (Initial Guess)

LLM

VLM

8. Risks & Considerations

9. Next Actions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions