Skip to content

[quantization] Calibration Dataset Selection & Sampling Strategy #645

@mhs4670go

Description

@mhs4670go

1. Dataset Candidates

We limit the candidates to practical, widely-used, and easy-to-integrate datasets.

1.1 LLM (LLaMA)

(A) Baseline

  • WikiText-2
    • Purpose: baseline comparison
    • Characteristics: plain natural language

(B) Instruction / Chat

  • Alpaca (Stanford Alpaca)

    • Instruction-following format
    • Single-turn prompt-response pairs
  • ShareGPT (filtered subset)

    • Multi-turn conversations
    • Closer to real chat usage

(C) Structured Tasks

  • FLAN-style mixtures (optional subset)
    • Includes QA, reasoning, summarization
    • Good for diverse activation patterns

1.2 VLM (Qwen3-VL)

(A) Text-only baseline

  • Same as LLM (WikiText / Alpaca subset)

(B) Vision-Language

  • COCO Captions

    • Image → caption mapping
    • Simple, stable baseline
  • VQAv2

    • Image + question → answer
    • Introduces cross-modal attention
  • Instruction-style VLM data (if available)

    • Image-grounded instructions
    • Closest to real usage

2. Sampling Strategy

2.1 Number of Samples

Recommended:

  • 512 ~ 2048 samples total

Guideline:

  • Too small → unstable calibration
  • Too large → diminishing returns

2.2 Sequence Length

We should match deployment characteristics:

Scenario Strategy
Chat / QA 128 ~ 512 tokens
Long context include some 1K+ samples
Mixed stratified by length

2.3 Sampling Method

Option A (Simple Random)

  • Uniform random sampling
  • Fast and easy baseline

Option B (Recommended: Stratified)

Split by:

  • Sequence length buckets
  • Prompt types (instruction / plain / QA)

Example:

  • 50% instruction/chat
  • 30% general text
  • 20% long-context samples

2.4 Prompt Formatting (Important)

For instruction models, use actual inference format:

Example:

### Instruction:
<instruction>

### Response:
<response>

or chat template:

<|system|>
...
<|user|>
...
<|assistant|>
...

Mismatch here can significantly affect activation distribution.

3. VLM Input Construction

For Qwen3-VL:

Each sample should include:

  • Image (or dummy image if needed)
  • Text prompt

Example:

User: What is happening in this image?
<image>

Important:

  • Maintain real inference preprocessing
  • Use actual tokenizer + image processor

4. Calibration Execution Details

4.1 Prefill + Decode Coverage

Ensure calibration includes:

  • Prefill (full sequence)
  • Short decode steps (important for KV cache behavior)

Example:

  • Run 1 full forward (prefill)
  • Run 2~4 decode steps

4.2 Token Distribution Coverage

We want to expose:

  • Special tokens (BOS, EOS, role tokens)
  • Punctuation-heavy inputs
  • Rare tokens (optional but helpful)

5. Ablation Plan

Dataset Expected Outcome
WikiText Baseline
Alpaca Better instruction alignment
ShareGPT Better chat realism
COCO (VLM) Basic multimodal alignment
VQAv2 (VLM) Complex cross-modal

6. Evaluation Focus

We prioritize:

  1. Perplexity delta vs FP
  2. lm-eval tasks (subset)
  3. Representative prompts
  4. Qualitative output stability
  5. (Optional) Layer-wise activation error

7. Recommended Default (Initial Guess)

If we had to pick a strong default:

LLM

  • 70% Alpaca
  • 30% WikiText

VLM

  • 50% COCO
  • 30% Alpaca-style text
  • 20% VQAv2

8. Risks & Considerations

  • Overfitting calibration to specific formats
  • Dataset preprocessing mismatch
  • Ignoring decode-phase behavior
  • Too homogeneous sampling

9. Next Actions

  • Implement dataset loaders
  • Add sampling pipeline (random + stratified)
  • Integrate prompt formatting
  • Run PTQ calibration across datasets
  • Compare results and finalize default

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions