Some basic evals run on various models that fit on a single DGX Spark.
| name | Agent Bench |
Assis' Bench CBOS |
Assis' Bench CBZS |
bfcl | The Agent Co |
|---|---|---|---|---|---|
| Qwen3.6 27B | 59.3% 2h 41m |
34.0% 2h 21m |
38.0% 2h 55m |
77.3% 1h 13m |
10.0% 1h 53m |
| Qwen3.6 27B FP8 | 58.7% 1h 44m |
37.1% 1h 32m |
37.5% 1h 37m |
75.3% 37m 26s |
10.0% 1h 47m |
| Qwen3.6 27B enable_thinking=False |
56.0% 1h 40m |
40.6% 35m 36s |
34.4% 46m 52s |
78.0% 11m 52s |
10.0% 1h 46m |
| Qwen3.6 35B-A3B FP8 | 55.3% 2h 9m |
78.0% 17m 3s |
|||
| Qwen3.6 35B-A3B | 52.7% 2h 34m |
35.7% 54m 49s |
37.9% 59m 20s |
78.0% 25m 5s |
16.7% 1h 40m |
| Qwen3.6 35B-A3B NVFP4 | 52.7% 2h 0m |
77.3% 18m 32s |
|||
| Gemma4 31B | 45.3% 2h 4m |
34.7% 23m 24s |
33.3% 23m 12s |
77.3% 19m 49s |
6.7% 1h 48m |
| Qwen3 Coder Next FP8 | 46.0% 32m 49s |
28.0% 7m 25s |
32.3% 9m 10s |
76.0% 3m 8s |
16.7% 1h 4m |
| Gemma4 26B-A4B | 44.0% 2h 16m |
30.7% 6m 28s |
26.3% 7m 30s |
78.0% 5m 4s |
6.7% 1h 5m |
Because some of these evals require installing packages and also spawn Docker containers, I recommend running everything inside a VM or on a spare machine. This does not need to run on the DGX Spark (and if you're running a large model, it might be better to run it on another machine). The instructions below assume a clean Ubuntu installation.
export PATH=$PATH:~/.local/bin
sudo apt-get update
sudo apt-get install -y --no-install-recommends ca-certificates curl jq python3 python3-pip python-is-python3
python3 -m pip install --break-system-packages openai inspect-evals inspect-evals[theagentcompany]
curl -fsSL https://get.docker.com -o get-docker.sh
chmod +x get-docker.sh
./get-docker.sh
sudo usermod -aG docker $USER
newgrp docker
rm get-docker.shEach time you want to run evals, set some env vars with the details of the LLM endpoint:
export EVAL_BASE_URL="http://192.168.0.132:8111/v1"
export EVAL_MODEL="gemma4"
export EVAL_RESULTS_FOLDER="gemma4-26b-a4b"mkdir -p ~/inspect-evals
cd ~/inspect-evals
export OPENAI_BASE_URL="${EVAL_BASE_URL}"
export OPENAI_API_KEY="NONE"
export INSPECT_EVAL_MODEL="openai/$EVAL_MODEL"
export DEBIAN_FRONTEND="noninteractive"
export PATH=$PATH:~/.local/bin
mkdir -p "results/$EVAL_RESULTS_FOLDER"
inspect eval-set \
--log-dir "results/$EVAL_RESULTS_FOLDER" --log-format json --log-dir-allow-dirty \
--no-log-realtime --no-log-samples --no-log-images --log-buffer 100 --no-score-display --no-fail-on-error \
--time-limit 900 --max-tasks 1 --max-connections 4 --max-subprocesses 4 --max-sandboxes 4 --limit 1-50 --epochs 3 \
inspect_evals/theagentcompany inspect_evals/bfcl inspect_evals/agent_bench_os inspect_evals/assistant_bench_closed_book_zero_shot inspect_evals/assistant_bench_closed_book_one_shot \
-T "categories=['exec_parallel_multiple','irrelevance']"The JSON results will be available in the results folder. Copy them into a clone of this repo in the results/ folder with a new folder for the model/variation you tested. Include the individual results JSON files (but not the full logs json or other metadata files), and a README.md containing the exact command used to launch the LLM inference engine with all flags, and some YAML frontmatter:
---
name: Nvidia Nemotron Super NVFP8
# Params in billions. For dense models, use `params: 100`
params:
total: 120
active: 12
# Any flags used that may affect accuracy
flags:
quantization: fp4
kv-cache-dtype: fp8
mamba-ssm-cache-dtype: float32
---
Run python3 tool/update_leaderboard.py to update the leaderboard at the top of this readme.
Finally, open a PR to share the results!