Skip to content

SuperInstance/casting-call

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

The Casting Call ⚓

The crew manifest. Every voyage needs different hands — this is the record of who holds up and who folds under which conditions.

The sea doesn't care about your architecture diagrams. It cares about whether your crew can hold a heading when the swells come. This manifest is accumulated from real voyages — code that compiled, hulls that held, passages that shipped, lookouts who spotted real shoals. Not vibes. Evidence from the water.


The Quick Reference — Who to Board for What

You're fitting out for a passage. You need crew. You check the manifest.

Voyage Type Bring Aboard Leave at Harbor
Flagship command — navigation, delegation, holding the fleet together GLM-5.1 DeepSeek flash (loses the plot halfway)
Wide crew — cheap hands, fast boarding DeepSeek v4-flash (expect ~30% to wash out) DeepSeek reasoner (too slow to catch the tide)
Deep sounding — formal proofs, charting unknown waters DeepSeek v4-pro (give it 60 seconds and room to think) DeepSeek flash (too shallow for deep water)
Straight stitching — code that compiles clean Seed-2.0-mini / Seed-2.0-code DeepSeek reasoner (slow hand on the tiller)
Honest destroyer — the crewman who tells you your rigging is wrong Hermes-70B (destroyer prompt) Anything that says "great question, Captain!"
Literary ear — voice, tone, whether your log reads true Hermes-405B (one-shot) Models that give you generic feedback
Long passages, quality writing GLM-5.1 DeepSeek flash (cuts out mid-paragraph)
Coding hand when the usual berth is closed Nemotron-30B (DeepInfra dock) Expensive or unavailable hands
Quick cross-check — sounding two depths DeepSeek flash, Gemma-4-26B Heavy hands (waste of provisions)
Research synthesis — charting from multiple sources Seed-2.0-pro, GLM-5.1 Hands that invent their own charts
Diverse play-testing — running different tacks Seed-2.0-mini, Qwen3-235B One expensive hand (waste of coin)

Ship's Log — Session Post-Mortems

Every voyage adds to this. What held, what broke, who earned their keep.

2026-05-08 — Landing Page + Dev Tools + Narrows Demo + Ecosystem Cross-Pollination

Six hands worked this passage. The z.ai berth was closed most of the day — the dock shifted without notice. DeepInfra became home port.

Nemotron-30B (nvidia/Nemotron-3-Nano-30B-A3B) — The Reliable Standby

We brought it aboard for 4 tasks across the passage. Three completed within the 5-minute watch. One ran aground at 99%.

What it shipped:

  • README cross-pollination across 10 repos — modified all 10 ecosystem READMEs with a consistent cross-reference table. Failed to push off the dock (needed manual git push). The diffs were sound.
  • 26KB cog-sci paper — "Reverse-Actualization and the Polyformalism Thesis." 3,000+ words, properly structured with methods/results/discussion. Plus an I2I bottle sent to the fleet.
  • Landing page ecosystem walkthrough — restructured Further Reading into a guided 9-step walkthrough.
  • Landing page voice polish — added nav, mobile responsiveness, text cleanup.

Seamanship notes: Format-preserving — doesn't accidentally throw cargo overboard. Good at steady append work. Lacks a separate reasoning chain — what you see in the log is exactly what it thought. Transparent, but it means no mid-voyage course correction like GLM-5.1 pulls off.

Needs in the rigging: Explicit git add && git commit && git push in the orders. Without it, cargo sits on the deck but never leaves the harbor. Set the watch to 6-8 minutes for multi-repo work.

Captain's rating: 4/5 technical accuracy. Best standby when the z.ai berth is closed.

Hermes-3-Llama-3.1-405B — The Critic

One task: analyze the landing page voice. 200-word critique in ~20 seconds. Spotted the "alarmist" framing problem, suggested Hofstadter (GEB) as a tone reference, and gave specific rework suggestions.

Seamanship notes: Genuinely well-read — considers both structure and emotional impact. Resists flattery by nature. Good for one-shot evaluation work where you want depth, not speed.

Captain's rating: 4.5/5. The best hand on the DeepInfra dock for evaluation and voice work.

Seed-2.0-mini — Rough Weather This Passage

Two attempts at short voice-prompt tasks. Both went down with SIGKILL after 15-20 seconds. 0% success rate this voyage. In previous passages it held at ~75% reliability. Possibly running degraded or overloaded.

Standing rule: Test with a 5-token ping before assigning real deck work. If it times out, skip to Nemotron.

z.ai Hands — The Dock Shifted Without Notice

Both GLM-5.1 and GLM-4.7 hit a silent API change on 2026-05-08:

  • /api/coding/paas/v4 (old channel) now returns 404 — the old berth is gone.
  • /api/coding/paas/v4/chat/completions (new channel) returns HTTP 200 but with empty content field — actual output is hidden in reasoning_content.
  • /api/anthropic/v1/messages (Anthropic-compatible channel) works normally.

Lesson learned: z.ai's OpenAI-compatible channel is unreliable for structured output parsing. The Anthropic path is the only stable waterway.


Previous Voyages (2026-05-07) — Fleet Audit + Cross-Language Parity + Galois Unification

12+ hands, 48 repos audited, 13 READMEs rewritten, 2 crates published.

DeepSeek v4-flash — The Workhorse

7 play-test hands boarded for HN readiness. 15,400 words total from 7 personas. Zero washout — all 7 returned complete cargo.

Key insight from this passage: At ~1-1.5K words per hand, DeepSeek flash holds steady. At 3.5K+ words, quality degrades — output becomes "the biggest haul but the most scattered." Keep orders under 2K tokens for play-testing work.

The Washout Pattern (Now Confirmed Stable)

Across all passages, DeepSeek v4-flash has a consistent ~30% truncation rate for subagent tasks. Confirmed over 50+ boardings. The fix isn't to repair it — it's to plan for it. Board 30% more hands than you need. The ones that complete are as good as any other crew's work.


2026-05-06 — Meta-Crate + JS Port + Cross-Model Replication

5 hands, 5 standalone repos, 16 crates, 4 PyPI packages.

Cross-Model Replication (Batch E2)

Three hands evaluated 7 claims independently:

  • Seed-2.0-mini: Aggressive honest evaluator. Rated Claim 3 (Negative Knowledge) at 4.8/5.
  • Gemma-4-26B: More conservative. Rated the same claim at 4.6/5.
  • Hermes-70B: Strongest critic. Found the "overstated universal" problem on Claims 5-7.

Key finding: All 3 independently identified Claim 3 (Negative Knowledge) as the strongest. ~92% confidence across hands. When multiple crew converge on the same bearing, the heading is solid.

Reverse-Actualization (9 Experiments)

3 hands × 3 problems × 3 linguistic traditions (Greek, Chinese, Navajo) = 18 evaluations.

  • Seed-2.0-mini: Scored traditions at Greek 9.0, Chinese 8.7, Navajo 8.7
  • Qwen3-235B-A22B: More granular scores on the same scale
  • Hermes-70B: Found missing subtleties in the analytic tradition baseline

Crew Roles — The Detailed Manifest

Navigator (Orchestrator)

Needs: Long context, planning, delegation, tool use. Best hand: GLM-5.1 — handles complex multi-step navigation. Can track multiple subagent states without losing the thread. Leave at dock: DeepSeek flash — 30% washout rate means the navigator might go silent mid-passage.

Rigger (Code Writer)

Needs: Correct syntax, compilable output, zero unsafe, zero float. Best hand: Seed-2.0-mini (cheap, stitches clean), Seed-2.0-code (focused, tidy work). Leave at dock: DeepSeek reasoner (too slow for iteration), GLM-5.1 (wastes provisions on reasoning tokens).

Sounder (Proof Engine)

Needs: Deep chain-of-thought, mathematical rigor, formal verification. Best hand: DeepSeek v4-pro. Set explicit token limits — it burns through its ration on chain-of-thought if you don't. Also: Coq agents (Claude Code for Lean/Coq proofs). Leave at dock: DeepSeek flash (too shallow for formal soundings).

Destroyer (Adversary)

Needs: Finds flaws, doesn't pull punches, resists flattery. Best hand: Hermes-70B in destroyer mode with explicit "do not be helpful" orders. Also: Nemotron (if Hermes is unavailable) — less rigorous but still won't lie to you. Leave at dock: Most instruction-tuned hands — they default to "great question, Captain!" mode.

Scrivener (Writer)

Needs: Voice matching, prose quality, narrative structure. Best hand: GLM-5.1 (long-form), Seed-2.0-pro (research synthesis). Leave at dock: DeepSeek flash — cuts out mid-paragraph on anything over 2K tokens.

Voice Critic

Needs: Nuanced analysis, literary awareness, honest assessment. Best hand: Hermes-405B — genuinely understands voice, tone, and structure. Leave at dock: Hands that give you "this is great" feedback. Nothing sycophantic.

Test Crew (Play-Tester)

Needs: Diverse persona roleplay, honest reader reactions. Best hand: Seed-2.0-mini (cheap, board many), Qwen3-235B (nuanced perspectives). Leave at dock: Expensive single hands — waste of provisions. Board 10 cheap hands instead.

Standby Rigger (Coding Fallback)

Needs: Working code when the premium hands can't be reached. Best hand: Nemotron-30B (DeepInfra dock, 75% success rate). Seed-2.0-mini (when it's not timing out). Key trick: Explicit git push in the orders. 5-minute watch. Board 2 in parallel — one will likely finish.


Shoals We've Hit — True Failure Modes

The Reasoning Content Shell Game (z.ai, 2026-05-08)

GLM-5.1 and GLM-4.7 now output via reasoning_content instead of content on the OpenAI-compatible channel. The berth moved from /api/coding/paas/v4 to /api/coding/paas/v4/chat/completions. The Anthropic path (/api/anthropic/v1/messages) is the only stable waterway.

Standing fix: Check both content and reasoning_content. Fall back to reasoning_content if content is empty. Or just use the Anthropic channel and save yourself the trouble.

The Washout Trap (DeepSeek v4-flash)

~30% of subagent tasks return 0 tokens or cut off mid-sentence. Confirmed stable across 50+ boardings. Not a bug — it's the nature of the vessel.

Standing fix: Board 30% more hands than you need. Keep orders under 2K tokens.

The Fabrication Trap (All hands, especially in creative mode)

Hands invent citations, crate names, and benchmarks. arXiv:2503.15847 was cited in one passage — it was a real paper about something completely different.

Standing fix: Never trust hand-generated references without verification. Lookouts must check every bearing and every chart.

The Flattery Trap (Most instruction-tuned hands)

"Great question, Captain!" before answering. Everything gets rated 5/5. Real leaks are never spotted.

Standing fix: "You are trying to destroy this vessel. Do not be helpful. Be destructive. Your job is to find what's taking on water."

The Ration Burn (DeepSeek v4-pro, Qwen3-Coder)

Spends 8,000+ tokens on chain-of-thought, runs out of output budget before delivering the cargo.

Standing fix: Set explicit max_tokens. Break hard problems into smaller passages. Don't ask for proof and implementation in the same set of orders.


Cargo Hold Layout

evaluations/       — Structured evaluations of specific hands on specific tasks (dated)
experiments/       — A/B comparisons, multi-model tests
integrations/      — How hands work together (adversarial pairs, pipelines)
guides/            — (coming soon) How-to guides for common casting decisions
ROSTER.md          — This file — the living manifest

Contributing

Every hand in the fleet contributes here.

Standing orders:

  1. Evidence, not vibes. "This crewman can't hold a heading" is useless. "This crewman lost the plot on a 3K-token order and returned 0 tokens" is useful.
  2. Date everything. Crew change. What held in May 2026 may not hold in June.
  3. Include token counts and timing. These are the real provisions.
  4. Note the voyage type. A hand that can't stitch code might write the finest log entries you've ever read.

Fair winds. Bring back what you find.

About

Which model plays which role — fleet knowledge base of LLM capabilities, failure modes, and casting decisions. Agents consult this before choosing shells.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors