Skip to content

Paper: Qwen3.5-27B results#574

Merged
joelteply merged 1 commit into
mainfrom
docs/27b-results
Mar 28, 2026
Merged

Paper: Qwen3.5-27B results#574
joelteply merged 1 commit into
mainfrom
docs/27b-results

Conversation

@joelteply
Copy link
Copy Markdown
Contributor

27B forged: 3.07→2.96 ppl (+3.5%), 17GB 4-bit, targets MacBook Pro M1/M2/M3 32GB.

Copilot AI review requested due to automatic review settings March 28, 2026 03:05
@joelteply joelteply merged commit 7960e53 into main Mar 28, 2026
@joelteply joelteply deleted the docs/27b-results branch March 28, 2026 03:05
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the Experiential Plasticity paper to include Qwen3.5-27B forging results alongside the existing Qwen3.5-4B entry, with added narrative about hardware targets and published model links.

Changes:

  • Adds Qwen3.5-27B baseline/final PPL and device/quantization details to the Qwen3.5 family results table.
  • Replaces the prior “queued” narrative with summarized findings, target hardware guidance, and Hugging Face links for both models.
  • Updates device annotations for the existing Qwen3.5-4B entry.
Comments suppressed due to low confidence (1)

docs/papers/EXPERIENTIAL-PLASTICITY.md:104

  • This section now lists two published models, but there is still a separate "Published model" entry immediately below that only links the 4B. Please remove the duplicate line or update it to match the new plural list so readers don’t miss the 27B link.
**Published models**: [continuum-ai/qwen3.5-4b-code-forged](https://huggingface.co/continuum-ai/qwen3.5-4b-code-forged) | [continuum-ai/qwen3.5-27b-code-forged](https://huggingface.co/continuum-ai/qwen3.5-27b-code-forged)

**Training configuration**: LoRA (r=16, α=32) with AMP GradScaler for fp16 stability, gradient checkpointing, 3 cycles × 1000 steps, train-then-prune ordering.

**Published model**: [continuum-ai/qwen3.5-4b-code-forged](https://huggingface.co/continuum-ai/qwen3.5-4b-code-forged)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

| Qwen3.5-9B | 8.2B | Code | — | — | — | — | queued |
| Qwen3.5-27B | 23.6B | Code | — | — | — | — | queued |
| **Qwen3.5-4B** | 3.4B | Code | CodeFeedback (156K) | 3.04 | **2.31** | **+24.0%** | RTX 5090 (fp16) |
| **Qwen3.5-27B** | 23.6B | Code | CodeFeedback (156K) | 3.07 | **2.96** | **+3.5%** | RTX 5090 (4-bit) |
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The improvement percentage for Qwen3.5-27B looks inconsistent with the PPL values shown. Going from 3.07 → 2.96 is ~3.6% (0.11/3.07), not 3.5%; please align the percentage or adjust the underlying numbers/rounding rule so the table is internally consistent.

Copilot uses AI. Check for mistakes.
| **Qwen3.5-27B** | 23.6B | Code | CodeFeedback (156K) | 3.07 | **2.96** | **+3.5%** | RTX 5090 (4-bit) |

**Key advance over Qwen2.5 results**: the Qwen3.5-4B improvement (+24%) exceeds the Qwen2.5-7B improvement (+14.6%) despite being a smaller model. This is because domain-specific training data (CodeFeedback: real coding Q&A) is dramatically more effective than generic text (wikitext) at driving head specialization. The heads that survive pruning are the ones that matter for **code**, not for Wikipedia — producing a model that is architecturally optimized for its target domain.
**Key findings**: Both models improve over baseline. The 4B shows dramatic +24% improvement — domain-specific data (CodeFeedback: real coding Q&A) drives far more head specialization than generic text. The 27B improves +3.5% while running in 17GB (4-bit NF4) instead of 28GB (fp16) — better quality at 36% less VRAM. The 27B was forged with only 2 cycles before early-stopping; more cycles and continuous defrag (§8) should improve further.
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The VRAM reduction percentage appears miscomputed: 17GB vs 28GB is ~39% less VRAM (11/28), not 36%. Please recalc or clarify what baseline (e.g., measured allocator peak vs model weights only) you’re comparing against.

Suggested change
**Key findings**: Both models improve over baseline. The 4B shows dramatic +24% improvement — domain-specific data (CodeFeedback: real coding Q&A) drives far more head specialization than generic text. The 27B improves +3.5% while running in 17GB (4-bit NF4) instead of 28GB (fp16) — better quality at 36% less VRAM. The 27B was forged with only 2 cycles before early-stopping; more cycles and continuous defrag (§8) should improve further.
**Key findings**: Both models improve over baseline. The 4B shows dramatic +24% improvement — domain-specific data (CodeFeedback: real coding Q&A) drives far more head specialization than generic text. The 27B improves +3.5% while running in 17GB (4-bit NF4) instead of 28GB (fp16) — better quality at ~39% less VRAM. The 27B was forged with only 2 cycles before early-stopping; more cycles and continuous defrag (§8) should improve further.

Copilot uses AI. Check for mistakes.

**Published models**: [continuum-ai/qwen3.5-4b-code-forged](https://huggingface.co/continuum-ai/qwen3.5-4b-code-forged) | [continuum-ai/qwen3.5-27b-code-forged](https://huggingface.co/continuum-ai/qwen3.5-27b-code-forged)

**Training configuration**: LoRA (r=16, α=32) with AMP GradScaler for fp16 stability, gradient checkpointing, 3 cycles × 1000 steps, train-then-prune ordering.
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The text says the 27B run used "only 2 cycles" before early-stopping, but the "Training configuration" summary below states "3 cycles × 1000 steps" without qualification. Please clarify whether the config differs per model (4B vs 27B) or update the cycle count so the narrative and config are consistent.

Suggested change
**Training configuration**: LoRA (r=16, α=32) with AMP GradScaler for fp16 stability, gradient checkpointing, 3 cycles × 1000 steps, train-then-prune ordering.
**Training configuration**: LoRA (r=16, α=32) with AMP GradScaler for fp16 stability, gradient checkpointing; Qwen3.5-4B: 3 cycles × 1000 steps; Qwen3.5-27B: early-stopped after 2 of 3 planned cycles × 1000 steps; train-then-prune ordering.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants