Loadable systems thinking for AI inference optimization.
This repo is a small public skill pack for Claude Code, Codex, ChatGPT/Codex-style agents, and other markdown-driven agent runtimes.
The point is simple:
Inference optimization is not a GPU-only problem.
It starts before the model call: prompts, context, tools, routing, retrieval, retries, sandboxes, product architecture, and task success criteria. Then it runs down through APIs, serving engines, KV cache, quantization, kernels, compilers, hardware placement, energy, capacity, and margin.
This pack gives an agent a way to think across that whole path without loading the whole world into context.
SKILL.md
-> router skill for full-stack inference optimization
templates/
AGENTS.md
CLAUDE.md
RTK.md
WORKLOAD_PROFILE.md
OPTIMIZATION_RECIPE.md
skills/
workload-shape-audit/
api-spend-recovery/
runtime-routing/
evidence-review/
Copy the root files into your agent project:
AGENTS.md / CLAUDE.md
-> RTK.md
-> inference-optimization-agent-pack/SKILL.md
-> one to three selected sub-skills
-> WORKLOAD_PROFILE.md
-> OPTIMIZATION_RECIPE.md
-> evidence-backed before/after
Do not load the whole pack by default.
Load the router first. Then load only the one to three sub-skills needed for the task.
measure the task
-> diagnose the leaking layer
-> change one thing
-> verify task success, latency, cost, and quality
-> record evidence
-> decide the next test
Use the right business unit:
- coding / chat / RAG: cost per successful task;
- voice: cost per completed call or resolved workflow;
- image / video / media: cost per accepted asset or approved creative batch.
Do not optimize for token price alone. Do not optimize for GPU utilization alone. Do not optimize for a benchmark that does not match the workload.
If you do not know where the leak is, start with:
skills/workload-shape-audit/SKILL.md
templates/WORKLOAD_PROFILE.md
If the team uses model APIs and the bill is too high, start with:
skills/api-spend-recovery/SKILL.md
templates/OPTIMIZATION_RECIPE.md
If the team is asking whether to self-host, start with:
skills/runtime-routing/SKILL.md
If someone claims savings or speedup, start with:
skills/evidence-review/SKILL.md
This is not a promise of automatic savings.
This is not a prompt trick.
This is not a recommendation to self-host by default.
This is a way to make the agent start from the workload, keep context small, route to the right skill, and demand evidence before calling something optimized.
Every skill in this repo should follow the Skill Testing Lifecycle:
draft: useful idea, not validated yet.template_backed: has a template and clear output shape.trace_backed: tested against real or replayed workload traces.field_validated: used in a customer or production-like workflow.release_ready: docs, examples, rollback notes, and evidence gates are complete.
Current state: template_backed.
See docs/STLC.md.
MIT.