Send a prompt to multiple language models in parallel and compare their outputs in the terminal. Useful for evaluating which model handles a given task better, measuring semantic similarity between responses, or running an LLM-as-judge evaluation — without leaving the shell.
pip install assayerSimilarity scoring requires the optional score extra:
pip install "assayer[score]"Python 3.11 or newer is required.
Contributing? See CONTRIBUTING.md for setup, code style, and PR guidelines.
- OpenAI: All GPT models.
- Anthropic: Claude 4.5 models (Opus, Sonnet, Haiku).
- Google Gemini: 1.5 Pro and Flash models.
- Ollama: Local models running on your machine.
Assayer looks for API keys in environment variables or a configuration file at ~/.assayer/config.json.
export OPENAI_API_KEY="your-key"
export ANTHROPIC_API_KEY="your-key"
export GEMINI_API_KEY="your-key"{
"OPENAI_API_KEY": "sk-...",
"ANTHROPIC_API_KEY": "sk-ant-...",
"GEMINI_API_KEY": "..."
}Use assayer models check to verify your configuration.
assayer run "Explain recursion in one sentence." --models gpt-4o,claude-haiku-4-5-20251001assayer run "prompt" --models gpt-4o,claude-sonnet-4-5
assayer run --prompt-file prompt.txt --models gpt-4o,ollama/llama3
assayer run "prompt" --models gpt-4o,claude-sonnet-4-5 --score
assayer run "prompt" --models gpt-4o,claude-sonnet-4-5 --judge gpt-4o --judge-criteria "clarity,brevity"
assayer run "prompt" --models gpt-4o,claude-sonnet-4-5 --output results.json
assayer run "prompt" --models gpt-4o,claude-sonnet-4-5 --output results.csv
assayer run "prompt with {var}" --models gpt-4o --var key=value| Flag | Description |
|---|---|
--models |
Comma-separated model identifiers (required) |
--prompt-file |
Path to a .txt file instead of an inline prompt |
--var |
KEY=VALUE template variable, repeatable |
--system |
System prompt applied to all models |
--temperature |
Sampling temperature |
--max-tokens |
Maximum output tokens |
--score |
Show pairwise similarity matrix |
--judge |
Model to use as judge |
--judge-criteria |
Comma-separated criteria for the judge |
--output |
Save results to .json or .csv |
assayer models list # list all supported model identifiers
assayer models check # check which API keys are configured
assayer models check ollama # check if Ollama is running and list local modelsassayer config set OPENAI_API_KEY sk-...
assayer config showKeys are saved to ~/.assayer/config.json. Environment variables take precedence.
export OPENAI_API_KEY=sk-...Supported models: gpt-5.5, gpt-5.5-pro, gpt-5.4, gpt-5.4-pro, gpt-5.4-mini, gpt-5.4-nano, gpt-5.2, gpt-5, gpt-5-mini, gpt-5-nano, gpt-4.1, gpt-4.1-mini, gpt-4.1-nano, gpt-4o, gpt-4o-mini, o3, o3-mini, o4-mini
export ANTHROPIC_API_KEY=sk-ant-...Supported models: claude-opus-4-7, claude-sonnet-4-6, claude-haiku-4-5-20251001, claude-opus-4-6, claude-sonnet-4-5, claude-opus-4-5
export GEMINI_API_KEY=...Supported models: gemini-3.1-pro-preview, gemini-3.1-flash-lite, gemini-3-flash-preview, gemini-2.5-pro, gemini-2.5-flash, gemini-2.5-flash-lite, gemini-2.0-flash, gemini-2.0-flash-lite
No API key needed. Start Ollama and use the ollama/ prefix:
ollama serve
assayer run "prompt" --models ollama/llama4-scout,ollama/llama3.2,ollama/qwen3--score embeds all outputs using all-MiniLM-L6-v2 (runs locally, no API call) and displays a pairwise cosine similarity matrix. Values range from 0 (unrelated) to 1 (identical meaning).
--judge <model> sends all outputs to the specified model and asks it to pick a winner. Use --judge-criteria to focus the evaluation:
assayer run "Write a sorting algorithm." \
--models gpt-4o,claude-sonnet-4-5 \
--judge gpt-4o \
--judge-criteria "correctness,readability"If the judge call fails, a warning is printed to stderr and the run continues normally.
--output results.json saves full results as JSON. --output results.csv saves as CSV. The file format is determined by the extension.