Inference Optimization Agent Pack

Loadable systems thinking for AI inference optimization.

This repo is a small public skill pack for Claude Code, Codex, ChatGPT/Codex-style agents, and other markdown-driven agent runtimes.

The point is simple:

Inference optimization is not a GPU-only problem.

It starts before the model call: prompts, context, tools, routing, retrieval, retries, sandboxes, product architecture, and task success criteria. Then it runs down through APIs, serving engines, KV cache, quantization, kernels, compilers, hardware placement, energy, capacity, and margin.

This pack gives an agent a way to think across that whole path without loading the whole world into context.

What Is In Here

SKILL.md
  -> router skill for full-stack inference optimization

templates/
  AGENTS.md
  CLAUDE.md
  RTK.md
  WORKLOAD_PROFILE.md
  OPTIMIZATION_RECIPE.md

skills/
  workload-shape-audit/
  api-spend-recovery/
  runtime-routing/
  evidence-review/

How To Use It

Copy the root files into your agent project:

AGENTS.md / CLAUDE.md
  -> RTK.md
  -> inference-optimization-agent-pack/SKILL.md
  -> one to three selected sub-skills
  -> WORKLOAD_PROFILE.md
  -> OPTIMIZATION_RECIPE.md
  -> evidence-backed before/after

Do not load the whole pack by default.

Load the router first. Then load only the one to three sub-skills needed for the task.

The Default Loop

measure the task
-> diagnose the leaking layer
-> change one thing
-> verify task success, latency, cost, and quality
-> record evidence
-> decide the next test

The Unit

Use the right business unit:

coding / chat / RAG: cost per successful task;
voice: cost per completed call or resolved workflow;
image / video / media: cost per accepted asset or approved creative batch.

Do not optimize for token price alone. Do not optimize for GPU utilization alone. Do not optimize for a benchmark that does not match the workload.

Where To Start

If you do not know where the leak is, start with:

skills/workload-shape-audit/SKILL.md
templates/WORKLOAD_PROFILE.md

If the team uses model APIs and the bill is too high, start with:

skills/api-spend-recovery/SKILL.md
templates/OPTIMIZATION_RECIPE.md

If the team is asking whether to self-host, start with:

skills/runtime-routing/SKILL.md

If someone claims savings or speedup, start with:

skills/evidence-review/SKILL.md

What This Is Not

This is not a promise of automatic savings.

This is not a prompt trick.

This is not a recommendation to self-host by default.

This is a way to make the agent start from the workload, keep context small, route to the right skill, and demand evidence before calling something optimized.

STLC

Every skill in this repo should follow the Skill Testing Lifecycle:

draft: useful idea, not validated yet.
template_backed: has a template and clear output shape.
trace_backed: tested against real or replayed workload traces.
field_validated: used in a customer or production-like workflow.
release_ready: docs, examples, rollback notes, and evidence gates are complete.

Current state: template_backed.

See docs/STLC.md.

License

MIT.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github		.github
docs		docs
skills		skills
templates		templates
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SKILL.md		SKILL.md
skill-pack.yaml		skill-pack.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Inference Optimization Agent Pack

What Is In Here

How To Use It

The Default Loop

The Unit

Where To Start

What This Is Not

STLC

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Inference Optimization Agent Pack

What Is In Here

How To Use It

The Default Loop

The Unit

Where To Start

What This Is Not

STLC

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages