Skip to content

Touchdown-Labs/inference-optimization-agent-pack

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Inference Optimization Agent Pack

Loadable systems thinking for AI inference optimization.

This repo is a small public skill pack for Claude Code, Codex, ChatGPT/Codex-style agents, and other markdown-driven agent runtimes.

The point is simple:

Inference optimization is not a GPU-only problem.

It starts before the model call: prompts, context, tools, routing, retrieval, retries, sandboxes, product architecture, and task success criteria. Then it runs down through APIs, serving engines, KV cache, quantization, kernels, compilers, hardware placement, energy, capacity, and margin.

This pack gives an agent a way to think across that whole path without loading the whole world into context.

What Is In Here

SKILL.md
  -> router skill for full-stack inference optimization

templates/
  AGENTS.md
  CLAUDE.md
  RTK.md
  WORKLOAD_PROFILE.md
  OPTIMIZATION_RECIPE.md

skills/
  workload-shape-audit/
  api-spend-recovery/
  runtime-routing/
  evidence-review/

How To Use It

Copy the root files into your agent project:

AGENTS.md / CLAUDE.md
  -> RTK.md
  -> inference-optimization-agent-pack/SKILL.md
  -> one to three selected sub-skills
  -> WORKLOAD_PROFILE.md
  -> OPTIMIZATION_RECIPE.md
  -> evidence-backed before/after

Do not load the whole pack by default.

Load the router first. Then load only the one to three sub-skills needed for the task.

The Default Loop

measure the task
-> diagnose the leaking layer
-> change one thing
-> verify task success, latency, cost, and quality
-> record evidence
-> decide the next test

The Unit

Use the right business unit:

  • coding / chat / RAG: cost per successful task;
  • voice: cost per completed call or resolved workflow;
  • image / video / media: cost per accepted asset or approved creative batch.

Do not optimize for token price alone. Do not optimize for GPU utilization alone. Do not optimize for a benchmark that does not match the workload.

Where To Start

If you do not know where the leak is, start with:

skills/workload-shape-audit/SKILL.md
templates/WORKLOAD_PROFILE.md

If the team uses model APIs and the bill is too high, start with:

skills/api-spend-recovery/SKILL.md
templates/OPTIMIZATION_RECIPE.md

If the team is asking whether to self-host, start with:

skills/runtime-routing/SKILL.md

If someone claims savings or speedup, start with:

skills/evidence-review/SKILL.md

What This Is Not

This is not a promise of automatic savings.

This is not a prompt trick.

This is not a recommendation to self-host by default.

This is a way to make the agent start from the workload, keep context small, route to the right skill, and demand evidence before calling something optimized.

STLC

Every skill in this repo should follow the Skill Testing Lifecycle:

  1. draft: useful idea, not validated yet.
  2. template_backed: has a template and clear output shape.
  3. trace_backed: tested against real or replayed workload traces.
  4. field_validated: used in a customer or production-like workflow.
  5. release_ready: docs, examples, rollback notes, and evidence gates are complete.

Current state: template_backed.

See docs/STLC.md.

License

MIT.

About

Loadable systems-thinking skill pack for full-stack inference optimization.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors