Skip to content

DanTup/spark-evals

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 

Repository files navigation

DGX Spark Evals

Some basic evals run on various models that fit on a single DGX Spark.

Leaderboard

name Agent
Bench
Assis'
Bench
CBOS
Assis'
Bench
CBZS
bfcl The
Agent
Co
Qwen3.6 27B 59.3%
2h 41m
34.0%
2h 21m
38.0%
2h 55m
77.3%
1h 13m
10.0%
1h 53m
Qwen3.6 27B FP8 58.7%
1h 44m
37.1%
1h 32m
37.5%
1h 37m
75.3%
37m 26s
10.0%
1h 47m
Qwen3.6 27B
enable_thinking=False
56.0%
1h 40m
40.6%
35m 36s
34.4%
46m 52s
78.0%
11m 52s
10.0%
1h 46m
Qwen3.6 35B-A3B FP8 55.3%
2h 9m
78.0%
17m 3s
Qwen3.6 35B-A3B 52.7%
2h 34m
35.7%
54m 49s
37.9%
59m 20s
78.0%
25m 5s
16.7%
1h 40m
Qwen3.6 35B-A3B NVFP4 52.7%
2h 0m
77.3%
18m 32s
Gemma4 31B 45.3%
2h 4m
34.7%
23m 24s
33.3%
23m 12s
77.3%
19m 49s
6.7%
1h 48m
Qwen3 Coder Next FP8 46.0%
32m 49s
28.0%
7m 25s
32.3%
9m 10s
76.0%
3m 8s
16.7%
1h 4m
Gemma4 26B-A4B 44.0%
2h 16m
30.7%
6m 28s
26.3%
7m 30s
78.0%
5m 4s
6.7%
1h 5m

Running Evals

Because some of these evals require installing packages and also spawn Docker containers, I recommend running everything inside a VM or on a spare machine. This does not need to run on the DGX Spark (and if you're running a large model, it might be better to run it on another machine). The instructions below assume a clean Ubuntu installation.

Set up Dependencies

export PATH=$PATH:~/.local/bin
sudo apt-get update
sudo apt-get install -y --no-install-recommends ca-certificates curl jq python3 python3-pip python-is-python3
python3 -m pip install --break-system-packages openai inspect-evals inspect-evals[theagentcompany]

curl -fsSL https://get.docker.com -o get-docker.sh
chmod +x get-docker.sh
./get-docker.sh
sudo usermod -aG docker $USER
newgrp docker
rm get-docker.sh

Configure the LLM Endpoint

Each time you want to run evals, set some env vars with the details of the LLM endpoint:

export EVAL_BASE_URL="http://192.168.0.132:8111/v1"
export EVAL_MODEL="gemma4"
export EVAL_RESULTS_FOLDER="gemma4-26b-a4b"

Start the Evals

mkdir -p ~/inspect-evals
cd ~/inspect-evals

export OPENAI_BASE_URL="${EVAL_BASE_URL}"
export OPENAI_API_KEY="NONE"
export INSPECT_EVAL_MODEL="openai/$EVAL_MODEL"
export DEBIAN_FRONTEND="noninteractive"
export PATH=$PATH:~/.local/bin

mkdir -p "results/$EVAL_RESULTS_FOLDER"
inspect eval-set \
  --log-dir "results/$EVAL_RESULTS_FOLDER" --log-format json --log-dir-allow-dirty \
  --no-log-realtime --no-log-samples --no-log-images --log-buffer 100 --no-score-display --no-fail-on-error \
  --time-limit 900 --max-tasks 1 --max-connections 4 --max-subprocesses 4 --max-sandboxes 4 --limit 1-50 --epochs 3 \
  inspect_evals/theagentcompany inspect_evals/bfcl inspect_evals/agent_bench_os inspect_evals/assistant_bench_closed_book_zero_shot inspect_evals/assistant_bench_closed_book_one_shot \
  -T "categories=['exec_parallel_multiple','irrelevance']"

Create a PR

The JSON results will be available in the results folder. Copy them into a clone of this repo in the results/ folder with a new folder for the model/variation you tested. Include the individual results JSON files (but not the full logs json or other metadata files), and a README.md containing the exact command used to launch the LLM inference engine with all flags, and some YAML frontmatter:

---
name: Nvidia Nemotron Super NVFP8
# Params in billions. For dense models, use `params: 100`
params:
  total: 120
  active: 12
# Any flags used that may affect accuracy
flags:
  quantization: fp4
  kv-cache-dtype: fp8
  mamba-ssm-cache-dtype: float32
---

Run python3 tool/update_leaderboard.py to update the leaderboard at the top of this readme.

Finally, open a PR to share the results!

About

Some benchmark results of small models and quants that fit on DGX Spark

Topics

Resources

Stars

Watchers

Forks

Languages