DGX Spark Evals

Some basic evals run on various models that fit on a single DGX Spark.

Leaderboard

name	Agent Bench	Assis' Bench CBOS	Assis' Bench CBZS	bfcl	The Agent Co
Qwen3.6 27B	59.3% 2h 41m	34.0% 2h 21m	38.0% 2h 55m	77.3% 1h 13m	10.0% 1h 53m
Qwen3.6 27B FP8	58.7% 1h 44m	37.1% 1h 32m	37.5% 1h 37m	75.3% 37m 26s	10.0% 1h 47m
Qwen3.6 27B enable_thinking=False	56.0% 1h 40m	40.6% 35m 36s	34.4% 46m 52s	78.0% 11m 52s	10.0% 1h 46m
Qwen3.6 35B-A3B FP8	55.3% 2h 9m			78.0% 17m 3s
Qwen3.6 35B-A3B	52.7% 2h 34m	35.7% 54m 49s	37.9% 59m 20s	78.0% 25m 5s	16.7% 1h 40m
Qwen3.6 35B-A3B NVFP4	52.7% 2h 0m			77.3% 18m 32s
Gemma4 31B	45.3% 2h 4m	34.7% 23m 24s	33.3% 23m 12s	77.3% 19m 49s	6.7% 1h 48m
Qwen3 Coder Next FP8	46.0% 32m 49s	28.0% 7m 25s	32.3% 9m 10s	76.0% 3m 8s	16.7% 1h 4m
Gemma4 26B-A4B	44.0% 2h 16m	30.7% 6m 28s	26.3% 7m 30s	78.0% 5m 4s	6.7% 1h 5m

Running Evals

Because some of these evals require installing packages and also spawn Docker containers, I recommend running everything inside a VM or on a spare machine. This does not need to run on the DGX Spark (and if you're running a large model, it might be better to run it on another machine). The instructions below assume a clean Ubuntu installation.

Set up Dependencies

export PATH=$PATH:~/.local/bin
sudo apt-get update
sudo apt-get install -y --no-install-recommends ca-certificates curl jq python3 python3-pip python-is-python3
python3 -m pip install --break-system-packages openai inspect-evals inspect-evals[theagentcompany]

curl -fsSL https://get.docker.com -o get-docker.sh
chmod +x get-docker.sh
./get-docker.sh
sudo usermod -aG docker $USER
newgrp docker
rm get-docker.sh

Configure the LLM Endpoint

Each time you want to run evals, set some env vars with the details of the LLM endpoint:

export EVAL_BASE_URL="http://192.168.0.132:8111/v1"
export EVAL_MODEL="gemma4"
export EVAL_RESULTS_FOLDER="gemma4-26b-a4b"

Start the Evals

mkdir -p ~/inspect-evals
cd ~/inspect-evals

export OPENAI_BASE_URL="${EVAL_BASE_URL}"
export OPENAI_API_KEY="NONE"
export INSPECT_EVAL_MODEL="openai/$EVAL_MODEL"
export DEBIAN_FRONTEND="noninteractive"
export PATH=$PATH:~/.local/bin

mkdir -p "results/$EVAL_RESULTS_FOLDER"
inspect eval-set \
  --log-dir "results/$EVAL_RESULTS_FOLDER" --log-format json --log-dir-allow-dirty \
  --no-log-realtime --no-log-samples --no-log-images --log-buffer 100 --no-score-display --no-fail-on-error \
  --time-limit 900 --max-tasks 1 --max-connections 4 --max-subprocesses 4 --max-sandboxes 4 --limit 1-50 --epochs 3 \
  inspect_evals/theagentcompany inspect_evals/bfcl inspect_evals/agent_bench_os inspect_evals/assistant_bench_closed_book_zero_shot inspect_evals/assistant_bench_closed_book_one_shot \
  -T "categories=['exec_parallel_multiple','irrelevance']"

Create a PR

The JSON results will be available in the results folder. Copy them into a clone of this repo in the results/ folder with a new folder for the model/variation you tested. Include the individual results JSON files (but not the full logs json or other metadata files), and a README.md containing the exact command used to launch the LLM inference engine with all flags, and some YAML frontmatter:

---
name: Nvidia Nemotron Super NVFP8
# Params in billions. For dense models, use `params: 100`
params:
  total: 120
  active: 12
# Any flags used that may affect accuracy
flags:
  quantization: fp4
  kv-cache-dtype: fp8
  mamba-ssm-cache-dtype: float32
---

Run python3 tool/update_leaderboard.py to update the leaderboard at the top of this readme.

Finally, open a PR to share the results!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DGX Spark Evals

Leaderboard

Running Evals

Set up Dependencies

Configure the LLM Endpoint

Start the Evals

Create a PR

About

Uh oh!

Contributors 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
results		results
tool		tool
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

DGX Spark Evals

Leaderboard

Running Evals

Set up Dependencies

Configure the LLM Endpoint

Start the Evals

Create a PR

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors 1

Languages