Skip to content

fix: PTQ 1GPU, export PP divisibility, hidden states conversations key#1293

Merged
ChenhanYu merged 1 commit intomainfrom
chenhany/fix-ptq-export-conversations
Apr 20, 2026
Merged

fix: PTQ 1GPU, export PP divisibility, hidden states conversations key#1293
ChenhanYu merged 1 commit intomainfrom
chenhany/fix-ptq-export-conversations

Conversation

@ChenhanYu
Copy link
Copy Markdown
Collaborator

@ChenhanYu ChenhanYu commented Apr 18, 2026

Summary

  • megatron_lm_ptq.yaml: Qwen3-8B PTQ to single GPU for L40 clusters (TP=1, all tasks)
  • quantize.sh: Auto-find largest PP dividing model's num_hidden_layers for export step. Qwen3-8B has 36 layers which isn't divisible by 8, causing AssertionError on 8-GPU nodes
  • compute_hidden_states_trtllm.py: Use messages with conversations fallback, matching the HF version. Fixes KeyError: 'conversations' when data uses OpenAI messages format

Test plan

  • Qwen3-8B PTQ runs on single L40 GPU
  • Export PP auto-selects valid divisor (36 layers → PP=6 on 8 GPUs, PP=4 on 4 GPUs, PP=1 on 1 GPU)
  • EAGLE3 offline pipeline reads data with messages field

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Dataset input handling now supports multiple field formats for enhanced compatibility.
  • Bug Fixes

    • Optimized GPU resource allocation during model quantization with improved pipeline parallelism computation.
    • Updated quantization configuration for more efficient resource utilization.

- megatron_lm_ptq.yaml: Qwen3-8B to single GPU for L40 clusters
- quantize.sh: auto-find largest PP dividing model num_hidden_layers
  for export (Qwen3-8B has 36 layers, not divisible by 8)
- compute_hidden_states_trtllm.py: use messages with conversations
  fallback (matching the HF version)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 18, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 18, 2026

📝 Walkthrough

Walkthrough

This PR modifies three areas: a Python dataset processing script to check the messages field before conversations when extracting chat prompts, a quantization shell script to dynamically compute pipeline parallel size based on model config rather than using all GPUs, and a YAML configuration that reduces tensor-parallel settings from 8 to 1 across quantization and evaluation tasks.

Changes

Cohort / File(s) Summary
Data Processing
examples/speculative_decoding/collect_hidden_states/compute_hidden_states_trtllm.py
Modified dataset entry field extraction to check entry.get("messages") first before falling back to entry.get("conversations") when building chat prompts.
Quantization Pipeline
tools/launcher/common/megatron_lm/quantize/quantize.sh
Added dynamic pipeline parallel (PP) size computation during checkpoint export; reads num_hidden_layers from model config JSON and selects largest pp value where num_hidden_layers % pp == 0, falling back to TOTAL_GPUS if read fails.
Configuration
tools/launcher/examples/Qwen/Qwen3-8B/megatron_lm_ptq.yaml
Reduced tensor-parallel (TP) settings from 8 to 1 across all pipeline steps with corresponding Slurm resource adjustments (ntasks_per_node and gpus_per_node reduced from 8 to 1).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title comprehensively references all three main changes: PTQ 1GPU configuration, export PP divisibility logic, and hidden states conversations key handling.
Security Anti-Patterns ✅ Passed PR changes do not introduce any security anti-patterns from SECURITY.md; Python modifications use safe dict.get() fallback, shell script uses safe subprocess practices, YAML contains only configuration adjustments, and no new dependencies or nosec comments were added.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chenhany/fix-ptq-export-conversations

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 18, 2026

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-04-20 19:44 UTC

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 18, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.33%. Comparing base (e9a4989) to head (6e456b0).
⚠️ Report is 7 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1293      +/-   ##
==========================================
+ Coverage   76.61%   77.33%   +0.71%     
==========================================
  Files         459      461       +2     
  Lines       49153    49524     +371     
==========================================
+ Hits        37661    38300     +639     
+ Misses      11492    11224     -268     
Flag Coverage Δ
examples 41.88% <ø> (-0.15%) ⬇️
regression 14.98% <ø> (+0.07%) ⬆️
unit 52.95% <ø> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ChenhanYu ChenhanYu marked this pull request as ready for review April 20, 2026 15:53
@ChenhanYu ChenhanYu requested a review from a team as a code owner April 20, 2026 15:53
@ChenhanYu ChenhanYu requested a review from yeyu-nvidia April 20, 2026 15:53
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tools/launcher/common/megatron_lm/quantize/quantize.sh`:
- Around line 46-59: TOTAL_GPUS and EXPORT_PP can be zero/unvalidated causing an
invalid PP=0 during the export invocation; before using them (the TOTAL_GPUS
assignment and the EXPORT_PP calculation plus the final export invocation that
sets TP/PP/EP/ETP), validate and clamp both to >=1 (and optionally to NUM_GPUS
if provided) and fallback to 1 on errors; update the python one-liners or add a
small shell check after them to coerce non-numeric or zero values to 1 and log
the corrected values so the final export invocation always receives a valid PP
(and TOTAL_GPUS) >=1.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 6f03ef3f-5620-445e-9c40-3500d0b3c23a

📥 Commits

Reviewing files that changed from the base of the PR and between e9a4989 and 6e456b0.

📒 Files selected for processing (3)
  • examples/speculative_decoding/collect_hidden_states/compute_hidden_states_trtllm.py
  • tools/launcher/common/megatron_lm/quantize/quantize.sh
  • tools/launcher/examples/Qwen/Qwen3-8B/megatron_lm_ptq.yaml

Comment on lines 46 to +59
TOTAL_GPUS=$(python3 -c "import torch; print(torch.cuda.device_count())" 2>/dev/null || echo ${NUM_GPUS:-1})
echo "=== Exporting ${MLM_MODEL_CFG} ${QUANT_CFG} (PP=${TOTAL_GPUS}) ==="
EXPORT_PP=$(python3 -c "
import json, os
cfg = os.path.join('${HF_MODEL_CKPT}', 'config.json')
n_layers = json.load(open(cfg)).get('num_hidden_layers', 1) if os.path.exists(cfg) else 1
gpus = ${TOTAL_GPUS}
pp = gpus
while pp > 1 and n_layers % pp != 0:
pp -= 1
print(pp)
" 2>/dev/null || echo ${TOTAL_GPUS})
echo "=== Exporting ${MLM_MODEL_CFG} ${QUANT_CFG} (PP=${EXPORT_PP}, ${TOTAL_GPUS} GPUs) ==="
export MLM_EXTRA_ARGS=
TP=1 PP=${TOTAL_GPUS} EP=1 ETP=1 MLM_MODEL_CKPT=${MLM_MODEL_SAVE} ${EXPORT_EXE} ${MLM_MODEL_CFG}
TP=1 PP=${EXPORT_PP} EP=1 ETP=1 MLM_MODEL_CKPT=${MLM_MODEL_SAVE} ${EXPORT_EXE} ${MLM_MODEL_CFG}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate and clamp GPU/PP values before export invocation.

At Line 46, torch.cuda.device_count() can return 0; then Line 55 can emit EXPORT_PP=0, and Line 59 runs export with invalid PP=0. Also, the values are used without numeric validation.

🔧 Suggested hardening patch
-TOTAL_GPUS=$(python3 -c "import torch; print(torch.cuda.device_count())" 2>/dev/null || echo ${NUM_GPUS:-1})
+TOTAL_GPUS=$(python3 -c "import torch; print(torch.cuda.device_count())" 2>/dev/null || echo "${NUM_GPUS:-1}")
+if ! [[ "${TOTAL_GPUS}" =~ ^[0-9]+$ ]] || [[ "${TOTAL_GPUS}" -lt 1 ]]; then
+    TOTAL_GPUS=1
+fi
 EXPORT_PP=$(python3 -c "
 import json, os
 cfg = os.path.join('${HF_MODEL_CKPT}', 'config.json')
 n_layers = json.load(open(cfg)).get('num_hidden_layers', 1) if os.path.exists(cfg) else 1
-gpus = ${TOTAL_GPUS}
+gpus = int('${TOTAL_GPUS}')
 pp = gpus
 while pp > 1 and n_layers % pp != 0:
     pp -= 1
 print(pp)
-" 2>/dev/null || echo ${TOTAL_GPUS})
+" 2>/dev/null || echo "${TOTAL_GPUS}")
 echo "=== Exporting ${MLM_MODEL_CFG} ${QUANT_CFG} (PP=${EXPORT_PP}, ${TOTAL_GPUS} GPUs) ==="
 export MLM_EXTRA_ARGS=
-TP=1 PP=${EXPORT_PP} EP=1 ETP=1 MLM_MODEL_CKPT=${MLM_MODEL_SAVE} ${EXPORT_EXE} ${MLM_MODEL_CFG}
+TP=1 PP="${EXPORT_PP}" EP=1 ETP=1 MLM_MODEL_CKPT="${MLM_MODEL_SAVE}" ${EXPORT_EXE} "${MLM_MODEL_CFG}"
🧰 Tools
🪛 Shellcheck (0.11.0)

[info] 46-46: Double quote to prevent globbing and word splitting.

(SC2086)


[info] 56-56: Double quote to prevent globbing and word splitting.

(SC2086)


[info] 59-59: Double quote to prevent globbing and word splitting.

(SC2086)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/launcher/common/megatron_lm/quantize/quantize.sh` around lines 46 - 59,
TOTAL_GPUS and EXPORT_PP can be zero/unvalidated causing an invalid PP=0 during
the export invocation; before using them (the TOTAL_GPUS assignment and the
EXPORT_PP calculation plus the final export invocation that sets TP/PP/EP/ETP),
validate and clamp both to >=1 (and optionally to NUM_GPUS if provided) and
fallback to 1 on errors; update the python one-liners or add a small shell check
after them to coerce non-numeric or zero values to 1 and log the corrected
values so the final export invocation always receives a valid PP (and
TOTAL_GPUS) >=1.

@ChenhanYu ChenhanYu merged commit 355c6b7 into main Apr 20, 2026
44 checks passed
@ChenhanYu ChenhanYu deleted the chenhany/fix-ptq-export-conversations branch April 20, 2026 19:44
@kevalmorabia97 kevalmorabia97 added the cherry-pick-0.44.0 After code freeze, cherry-pick to release branch for next rc (bulk update). Only for bug fixes / doc label Apr 21, 2026
kevalmorabia97 pushed a commit that referenced this pull request Apr 21, 2026
#1293)

## Summary
- **megatron_lm_ptq.yaml**: Qwen3-8B PTQ to single GPU for L40 clusters
(TP=1, all tasks)
- **quantize.sh**: Auto-find largest PP dividing model's
`num_hidden_layers` for export step. Qwen3-8B has 36 layers which isn't
divisible by 8, causing `AssertionError` on 8-GPU nodes
- **compute_hidden_states_trtllm.py**: Use `messages` with
`conversations` fallback, matching the HF version. Fixes `KeyError:
'conversations'` when data uses OpenAI `messages` format

## Test plan
- [x] Qwen3-8B PTQ runs on single L40 GPU
- [x] Export PP auto-selects valid divisor (36 layers → PP=6 on 8 GPUs,
PP=4 on 4 GPUs, PP=1 on 1 GPU)
- [x] EAGLE3 offline pipeline reads data with `messages` field

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
* Dataset input handling now supports multiple field formats for
enhanced compatibility.

* **Bug Fixes**
* Optimized GPU resource allocation during model quantization with
improved pipeline parallelism computation.
* Updated quantization configuration for more efficient resource
utilization.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@kevalmorabia97 kevalmorabia97 added the cherry-pick-done Added by bot once PR is cherry-picked to the release branch label Apr 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick-0.44.0 After code freeze, cherry-pick to release branch for next rc (bulk update). Only for bug fixes / doc cherry-pick-done Added by bot once PR is cherry-picked to the release branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants