Paper: Qwen3.5-27B results#574
Conversation
There was a problem hiding this comment.
Pull request overview
Updates the Experiential Plasticity paper to include Qwen3.5-27B forging results alongside the existing Qwen3.5-4B entry, with added narrative about hardware targets and published model links.
Changes:
- Adds Qwen3.5-27B baseline/final PPL and device/quantization details to the Qwen3.5 family results table.
- Replaces the prior “queued” narrative with summarized findings, target hardware guidance, and Hugging Face links for both models.
- Updates device annotations for the existing Qwen3.5-4B entry.
Comments suppressed due to low confidence (1)
docs/papers/EXPERIENTIAL-PLASTICITY.md:104
- This section now lists two published models, but there is still a separate "Published model" entry immediately below that only links the 4B. Please remove the duplicate line or update it to match the new plural list so readers don’t miss the 27B link.
**Published models**: [continuum-ai/qwen3.5-4b-code-forged](https://huggingface.co/continuum-ai/qwen3.5-4b-code-forged) | [continuum-ai/qwen3.5-27b-code-forged](https://huggingface.co/continuum-ai/qwen3.5-27b-code-forged)
**Training configuration**: LoRA (r=16, α=32) with AMP GradScaler for fp16 stability, gradient checkpointing, 3 cycles × 1000 steps, train-then-prune ordering.
**Published model**: [continuum-ai/qwen3.5-4b-code-forged](https://huggingface.co/continuum-ai/qwen3.5-4b-code-forged)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| | Qwen3.5-9B | 8.2B | Code | — | — | — | — | queued | | ||
| | Qwen3.5-27B | 23.6B | Code | — | — | — | — | queued | | ||
| | **Qwen3.5-4B** | 3.4B | Code | CodeFeedback (156K) | 3.04 | **2.31** | **+24.0%** | RTX 5090 (fp16) | | ||
| | **Qwen3.5-27B** | 23.6B | Code | CodeFeedback (156K) | 3.07 | **2.96** | **+3.5%** | RTX 5090 (4-bit) | |
There was a problem hiding this comment.
The improvement percentage for Qwen3.5-27B looks inconsistent with the PPL values shown. Going from 3.07 → 2.96 is ~3.6% (0.11/3.07), not 3.5%; please align the percentage or adjust the underlying numbers/rounding rule so the table is internally consistent.
| | **Qwen3.5-27B** | 23.6B | Code | CodeFeedback (156K) | 3.07 | **2.96** | **+3.5%** | RTX 5090 (4-bit) | | ||
|
|
||
| **Key advance over Qwen2.5 results**: the Qwen3.5-4B improvement (+24%) exceeds the Qwen2.5-7B improvement (+14.6%) despite being a smaller model. This is because domain-specific training data (CodeFeedback: real coding Q&A) is dramatically more effective than generic text (wikitext) at driving head specialization. The heads that survive pruning are the ones that matter for **code**, not for Wikipedia — producing a model that is architecturally optimized for its target domain. | ||
| **Key findings**: Both models improve over baseline. The 4B shows dramatic +24% improvement — domain-specific data (CodeFeedback: real coding Q&A) drives far more head specialization than generic text. The 27B improves +3.5% while running in 17GB (4-bit NF4) instead of 28GB (fp16) — better quality at 36% less VRAM. The 27B was forged with only 2 cycles before early-stopping; more cycles and continuous defrag (§8) should improve further. |
There was a problem hiding this comment.
The VRAM reduction percentage appears miscomputed: 17GB vs 28GB is ~39% less VRAM (11/28), not 36%. Please recalc or clarify what baseline (e.g., measured allocator peak vs model weights only) you’re comparing against.
| **Key findings**: Both models improve over baseline. The 4B shows dramatic +24% improvement — domain-specific data (CodeFeedback: real coding Q&A) drives far more head specialization than generic text. The 27B improves +3.5% while running in 17GB (4-bit NF4) instead of 28GB (fp16) — better quality at 36% less VRAM. The 27B was forged with only 2 cycles before early-stopping; more cycles and continuous defrag (§8) should improve further. | |
| **Key findings**: Both models improve over baseline. The 4B shows dramatic +24% improvement — domain-specific data (CodeFeedback: real coding Q&A) drives far more head specialization than generic text. The 27B improves +3.5% while running in 17GB (4-bit NF4) instead of 28GB (fp16) — better quality at ~39% less VRAM. The 27B was forged with only 2 cycles before early-stopping; more cycles and continuous defrag (§8) should improve further. |
|
|
||
| **Published models**: [continuum-ai/qwen3.5-4b-code-forged](https://huggingface.co/continuum-ai/qwen3.5-4b-code-forged) | [continuum-ai/qwen3.5-27b-code-forged](https://huggingface.co/continuum-ai/qwen3.5-27b-code-forged) | ||
|
|
||
| **Training configuration**: LoRA (r=16, α=32) with AMP GradScaler for fp16 stability, gradient checkpointing, 3 cycles × 1000 steps, train-then-prune ordering. |
There was a problem hiding this comment.
The text says the 27B run used "only 2 cycles" before early-stopping, but the "Training configuration" summary below states "3 cycles × 1000 steps" without qualification. Please clarify whether the config differs per model (4B vs 27B) or update the cycle count so the narrative and config are consistent.
| **Training configuration**: LoRA (r=16, α=32) with AMP GradScaler for fp16 stability, gradient checkpointing, 3 cycles × 1000 steps, train-then-prune ordering. | |
| **Training configuration**: LoRA (r=16, α=32) with AMP GradScaler for fp16 stability, gradient checkpointing; Qwen3.5-4B: 3 cycles × 1000 steps; Qwen3.5-27B: early-stopped after 2 of 3 planned cycles × 1000 steps; train-then-prune ordering. |
27B forged: 3.07→2.96 ppl (+3.5%), 17GB 4-bit, targets MacBook Pro M1/M2/M3 32GB.