Skip to content

SPRT and Fixed‐Game Workloads

Andrew Grant edited this page Mar 29, 2024 · 1 revision

The Create Test button on the OpenBench sidebar can be used to create SPRT tests, as well as to play a fixed number of games between two engines. The test creation page has 6 subsections of fields to fill out. The vast majority of this form will be pre-filled if your instance and engine are configured optimally. Below, each subsection will be explained in detail.

When attempting to create a Test, OpenBench will verify all of the input fields, and display error messages for all issues in the web browser. This should help to rectify the issues. Also, OpenBench will display a warning, informing you that the Base branch _appears _to be ahead of the Dev branch. This can be ignored, but helps to indicate cases where someone else has pushed to the Base branch without your knowledge.

Engines

In this field we see four fields: Dev Engine, Dev Source, Base Engine, and Base Source. Throughout the entire testing process, we will commonly refer to the two versions of the engines as Dev and Base. Typically, Dev refers to the speculative changes you want to test, and Base refers to the current known best, typically master or main.

From the drop down menu, you may select the Engine for both Dev and Base. The list of engines that appear is determined by the configuration files in /Engines/. To read about adding a new engine, refer to the documentation here. Selecting the correct engine here is critical, as the configuration associated with the selected engine contains information needed to successfully build, and possibly even execute the engine.

Additionally, you must fill out the Source field for each engine. This will be the location of a repository on Github. This can be a private repository, if the engine you are using has been configured as such.

Note: The Engine fields will auto-populate to the default engine for any given user. That can be set in the Profile page, located on the OpenBench sidebar. When doing this, you may also specify a default repository for the engine. If no repository is provided, the form will auto-fill using the source specified in the engine's configuration file.

Engine Settings

There are two sections of fields, one for Dev and one for Base. The Branch field can be used to provide the Github branch you would like to use. This can also be a Github tag, and a full-length commit sha. If OpenBench fails to locate the ref that you provided, it will notify you after attempting to create the test. Note: The Base Branch will auto-fill based on the engine configuration. Typically to master, or main.

Next is the Bench field. This defaults to "Autofill". If left untouched, OpenBench will attempt to parse the most recent commit message on each branch, to determine the bench. There are a number of accepted formats, but a simple one is: bench 1234567. For ease of use, it is advised to always commit bench values. You can refer to Client/bench.py's parse_stream_output() function to learn more about the regex parsing.

Next is the Network field. This will contain a dropdown list of all of the Networks uploaded to the framework for a specific engine. You may select from any, but be aware of whether the branch you selected is able to use the network you selected. Note: The Network will default to the current "default Network" for the engine. This can be seen, and selected, in the Networks page, found on the OpenBench sidebar.

Next is the Options field. These values are generally auto-filled, using the test mode buttons, which are defined in the engine configuration. It is advised to save commonly used configurations. Note: By default, these fields will attempt to populate with the STC settings for an engine. It is required that Threads= and Hash= are present in the engine options.

Lastly, we have the Time field. A number of formats are supported. They are described below. OpenBench provides a time margin buffer through cutechess of 250ms when playing games with proper time controls. This is to avoid time losses, in unusually unstable conditions that might exist on someone else's hardware.

Type Format Description
Fischer 10.0+0.10 Base time in seconds, + (optional) increment in seconds
Cyclic 40/10.0+1.0 Moves per time addition, size of time addition, + (optional) increment in seconds
Fixed Depth D=10 Fixed depth searches, with infinite time
Fixed Nodes N=10000 Fixed node searches, with infinite time
Fixed MoveTime MT=1000 Search for exact amount of time, for each move. An overhead is provided

Test Settings

The Test Mode dropdown allows for SPRT tests, and for Fixed Games tests. For SPRT tests, the Bounds and Confidence must be set. The format for both of these fields is [lower, upper]. The Bounds field refers to the elo bounds of the null and alternate hypothesis. Take a look at the fishtest SPRT Calculator to learn more. The Confidence refers to the type I and type II errors, refered to as alpha and beta in the fishtest page. Generally, a value of 0.05 is a good value. For Fixed Games tests, an upper bound on the number of games, Max Games must be set.

Note: These fields can be auto-filled by the configuration buttons mentioned before.

General Settings

The Opening Book dropdown allows you to select from the books configured in Books.json. Both EPD and PGN formats are supported. Books that contain the phrase "FRC" or "960" are assumed to be Fischer random. As a result, you need not set Fischer Random or Double-Fischer Random explicitly.

Priority is used to create an order of test completion. A worker will always take the highest priority test that they are able to play. If there is a priority 1 test, and a priority 0 test, but the worker cannot complete the priority 1 test ( Can't build the engine; Missing the token; Lacks other requirements; ... ), the priority 0 test will be assigned.

Throughput is used to scale the amount of workers that a test gets. This is best explained with an example. Suppose there are 4 tests with the same priority, which all workers are able to complete. Each test has a throughput of 1000. OpenBench will assign the workers such that each test has ~25% of them ( +- some allowed variance ). However, if you set the throughput of your test to 3000, then OpenBench will give you ~50% of the workers. This is because your test (3000) has 50% of the total (6000) throughput in the queue. Throughput can go as low as 1, and has no upper bound.

Workload Size controls the number of game pairs played in a workload. This is somewhat tricky, as it can depend on the concurrency settings of the engines. This is once again best explained with an example. If you connect a 32 core machine to a test, which has a workload size of 4, and both the dev and base engines have Threads=1 in their options... The worker will launch cutechess with a concurrency of 32. For each concurrency, you will look to play 4 game pairs, or 8 games. In total, you will play 256 games before finishing. However, if the Dev engine had Threads=4, you would only be able to play 8 concurrent games. As a result, you will look to play 64 games before finishing. For SPRT and Fixed Tests, this value is only used to help make sure that workers on the framework are "jumpy", and don't get stuck on one test for too long.

Syzygy WDL allows you to require a Worker to have the Syzygy WDL Tablebases, which will be provided to the engine. This can be disabled. Specific sized tablebases can be enforced. Or simply we can use "Optional" to let the Worker use whatever it has. Generally it is considered a poor idea to use Syzygy WDL, unless you are specifically testing something. This is because the read times for the tablebases, with so many concurrent games, can produce slowdowns.

Adjudication Settings

In order to save time, it is common practice to adjudicate games. This should be done aggressively, so long as the accuracy of adjudication remains very high. That can be measured independently. OpenBench provides three possible mechanisms to adjudicate:

Syzygy Adj. uses the Syzygy WDL tablebases to adjudicate the games via Cutechess. Like the Syzygy WDL option, this can be disabled, set to a specific size, or left entirely optional. Optional is recommended to save time.

Win Adj. will adjudicate the game once the engines have exceeded a certain margin for a number of moves. The format of this option is movecount=N score=M. This can also be set to None to disable all adjudication explicitly. Leaving the field blank will produce an error.

Draw Adj. will adjudicate the game once the engines have been below certain margin for a number of moves, without a capture or pawn push, after a certain point in the game. The format of this option is movenumber=P movecount=N score=M. This can also be set to None to disable all adjudication explicitly. Leaving the field blank will produce an error.