Project: Stratego LLM Test Based Games

1. Introduction

Stratego LLM is a research framework designed to evaluate the strategic reasoning and behavioral characteristics of Large Language Models (LLMs) in an imperfect-information game setting.

Unlike static benchmarks, this project pits models (e.g., Mistral, Gemma, Llama, Qwen) against each other in the board game Stratego to analyze dynamic performance. The primary goal is to determine which model performs better by measuring:

Win Rates & Dominance: Quantitative analysis of Win/Loss/Draw ratios across 100+ match simulations.

Behavioral Profiling: Classifying models as Stable (consistent, rule-abiding) vs. Aggressive (high attack frequency, risky plays).

Efficiency: Measuring time-to-move and token consumption to determine the "cost of intelligence."

Strategic Consistency: Analyzing how often models hallucinate invalid moves versus making logically sound decisions.

The system includes an automated arena for batch matchmaking, a custom logger for dataset creation, and a prompt-optimizer that refines strategies based on match outcomes.

2. Project Initialization (Local Setup)

Follow these steps to set up the development environment on your local machine.

Prerequisites

Git: Ensure Git is installed.
Python: Python 3.8+ is recommended.

Installation Steps

Clone the Repository Clone the project to your local machine (e.g., in VS Code).

git clone https://github.com/davszi/Stratego.git
cd Stratego

Create and Activate Virtual Environment It is highly recommended to work within a virtual environment.

Windows (PowerShell)

python -m venv .venv
.\.venv\Scripts\Activate.ps1

Windows (CMD)

python -m venv .venv
.\.venv\Scripts\activate.bat

macOS / Linux

python3 -m venv .venv
source .venv/bin/activate

Note: If successful, you will see (.venv) at the start of your terminal line. To exit later, simply type deactivate.

Install Dependencies Update pip and install the package in editable mode.

python -m pip install --upgrade pip
pip install -e .

To install additional dependencies for Hugging Face models:

pip install -e ".[hf]"

Verify Installation Use the following commands to install environment files for Stratego Duel and Custom into the textarena folder.

stratego-install-env

3. Server Configuration (TU Clausthal SSH)

These steps are specific to the TU Clausthal cloud environment to manage disk quota and cache locations.

Connecting via SSH

Connect to the server using port forwarding for Ollama (Port 11437 in this example).

ssh -L 11437:localhost:11437 {user}@cloud-247.rz.tu-clausthal.de

Managing Cache (Critical)

To avoid filling up your home directory, redirect caches to the /scratch directory. Run these commands in your Bash terminal on the server.

Create Cache Directories:

mkdir -p /scratch/{user}/vs_cache
mkdir -p /scratch/{user}/hf_cache
mkdir -p /scratch/{user}/pip_cache

Export Environment Variables: Run the following to set the paths for the current session:

export VSCODE_SERVER_CACHE=/scratch/{user}/vs_cache
export HF_HOME=/scratch/{user}/hf_cache
export HUGGINGFACE_HUB_CACHE=/scratch/{user}/hf_cache
export TRANSFORMERS_CACHE=/scratch/{user}/hf_cache
export HF_DATASETS_CACHE=/scratch/{user}/hf_cache
export PIP_CACHE_DIR=/scratch/{user}/pip_cache

Permanent Configuration (Optional): To make these changes permanent, edit your .cshrc file:

nano ~/.cshrc

Add lines such as setenv HF_HOME /scratch/{user}/hf_cache for each variable listed above. Press Ctrl+X to save and exit. 4. Cleanup: If you have existing cache in your home directory, clear it to free up space:

rm -r ~/.cache

Restart VS Code to apply these changes.

4. Setting up Ollama on SSH Server

If you wish to host your own Ollama instance on the server:

Installation

Prepare Directories:

mkdir -p /scratch/{user}/ollama_bin
mkdir -p /scratch/{user}/ollama_model
mkdir -p /scratch/{user}/ollama_tmp
cd /scratch/{user}/ollama_bin

Download and Extract:

curl -fL -o ollama-linux-amd64.tar.zst https://github.com/ollama/ollama/releases/download/v0.14.0/ollama-linux-amd64.tar.zst
tar --use-compress-program=unzstd -xvf ollama-linux-amd64.tar.zst

Configure Environment: Set the paths so Ollama knows where to store models and temporary files.

export OLLAMA_MODELS=/scratch/{user}/ollama_model
export OLLAMA_TMPDIR=/scratch/{user}/ollama_tmp
export OLLAMA_HOST=0.0.0.0:{your_host_port}
export PATH="/scratch/{user}/ollama_bin/bin:$PATH"

Running the Server

It is recommended to run the server inside a tmux session so it persists after you disconnect.

Start a new session: tmux new -s ollama_server
Start Ollama: ollama serve
Detach: Press Ctrl+B then D.

Managing Models

Open a new local terminal (connected via SSH) to interact with the running server.

Check available models:

curl -s http://127.0.0.1:11435/api/tags | jq -r '.models[].name'

Pull (Download) a new model:

curl -X POST http://127.0.0.1:{your_host_port}/api/pull \
     -H 'Content-Type: application/json' \
     -d '{"name":"mistral:7b"}'

5. Usage & Gameplay

You can run the game using the stratego command. Ensure your Ollama client is running if using local LLMs.

Basic Command:

stratego --p0 ollama:mistral:7b --p1 ollama:gemma3:1b --prompt base

Arguments:

--p0: Agent for Player 0 (e.g., ollama:mistral:7b or hf:TinyLlama/TinyLlama-1.1B-Chat-v1.0).
--p1: Agent for Player 1.
--prompt: The prompt strategy to use.
--size: Board size (NxN). Default is 10.

# Example for a smaller board
stratego --p0 ollama:mistral:7b --p1 ollama:mistral:7b --size 6

6. Dataset & Prompt Optimization

The system includes automated logging and prompt improvement mechanisms implemented in main.py.

Logging (GameLogger)

Every move, prompt, and metadata is saved to CSV logs. The cli() function initializes a GameLogger:

Log Directory: Controlled via --log-dir (default: logs).
Initial Prompt Logging: The logger captures the exact initial prompt used by the agent to ensure reproducibility.

Automated Prompt Improvement

The system automatically attempts to improve the System Prompt based on gameplay data.

Mechanism: The runner checks the number of CSV logs in the log directory.
Trigger: Every 3 games, improve_prompt() is called.
Logic: It analyzes recent games and updates stratego/prompts/current_prompt.txt.

Hugging Face Dataset Integration

To upload your game logs to Hugging Face:

Install Libraries:

pip install huggingface huggingface_hub datasets

Configuration:

Join your HF Organization.
Update ./datasets/uploader.py with your repository name.

Authentication:

Create a WRITE token in your Hugging Face settings.
Run hf auth login in your terminal and paste the token (do not save as git credential).

Upload: Once authenticated, use the uploader script to push logs to the dataset repository.

7. Benchmarking

Use the built-in benchmark tool to evaluate model performance over multiple games.

Command:

benchmark --p0 {model_A} --p1 {model_B} --size {N} --game {count}

Example:

benchmark --p0 llama3.2:1b --p1 gemma3:1b --size 6 --game 4

8. GUI

After installing the package, you can use command gui to run the game in graphic user interface.

9. Scoring

Since we played different number of games for each model, we needed another method to finalize the score. We made score equation based on following factors: win/draw rate, number of games that reached turn limit, number of total played games, win rate as playing as Player 1, and number of losses with invalid moves.

Score based on the win/draw rate of various game board. $$S^{size}m = \frac{\sum{s\in{{4,5,6}}}w_s(W_{m,s}+0.5D_{m,s})}{\sum_{s\in{{4,5,6}}}w_sN_{m,s}}$$ $m$ is model, $o$ is opponent, $w_s$ is the weight for each board size, $N_{m,s}$ is total number of the game that the model $m$ played for $s$ board size, $W_{m,s}$ as number of wins of the model $m$ with $s$ board size, and $D_{m,s}$ as number of draws fo teh model $m$ with $s$ board size.
Score based on the total number of games that reached turn limit. $$S^{speed}_m = 1-\frac{EndedByTernLimit_m}{N_m}$$ $N_m$ is total number of games played by the model $m$.
Reduction point based on the number of gamed lost by invalid moves. $$R^{inv}m = \frac{LostByInvalid_m}{N_m}$$ For the bonus point for playing as Player 1, compute win rate as player1 against an opponent. $$p1(m \rarr o)=\frac{W^{P1}{m,o}+0.5D_{m,o}}{G_{m,o}}$$ $G_{m,o}$ as total game played between the model $m$ as Player 1 and opponent $o$ as Player 2.
- Now, compute opponent's Player 1 win rate from the set. $$p1(o \rarr m)=\frac{W^{P1}{o,m}+0.5D{o,m}}{G_{o,m}}$$ $G_{o,m}$ as total game played between the opponent $o$ as Player 1 and model $m$ as Player 2.
- Next, compute the win rate of the model $m$ as it playing as Player 2 against same opponent $o$, means $o$ plays Player1. $$p2(m \larr o)=\frac{W^{P2}{o,m}+0.5D{o,m}}{G_{o,m}}$$
- Compute the difference between the Player 1 win rate between model $m$ and opponent $o$. $$edge_1(m,o) = p1(m\rarr o)-p1(o\rarr m)$$
- Next, compare the winrate from $p1$ between $p2$ of the model $m$. $$edge_2(m,o) = p1(m\rarr o)-p2(m\larr o)$$ $edge_2(m,o)$ is usually negative, since the models have higher win rate, when they play as Player2. They play biased.
- Now compute the total bonus from the match. $$bonus_{pair}(m,o) = max(0,edge_1(m,o))+max(0,edge_2(m,o))$$
- Adjust standard for giving weight. Set standard as less number of games to being pair. $$w_{m,o}=min(G_{m,o},G_{o,m})$$
- Then compute raw bonus point by repeating those steps above with all opponents. $$B^{raw}m = \frac{\sum_o w{m,o}bonus_{pair}(m,o)}{\sum_o w_{m,o}}$$
- Find the total bonus. Clamp it between 0 to 1, means the minimum is 0 and maximum is 1 for the final bonus. $$B_m=clip(\frac{B^{raw}_{m}}{0.2},0,1)$$ For here, set the limit dividing with 0.2.
Computed Reliability Coefficient using Buehlmann $k$. $k$ is set to 300 here, to make $C_m$ for only 100 games played models to 0.5, means half reliable. $$C_m=\sqrt{\frac{N_m}{N_m+k}}$$
For computing final score, we set the weights as following:
$a=0.60$ reward as played bord size
$b=0.15$ reward for ending game in turn limit
$c=0.25$ deduction for playing invalid
$d=0.20$ reward for winning as player 1
$w_4=0.75$ weight for 4 x 4 board
$w_5=1.00$ weight for 5 x 5 board
$w_6=1.35$ weight for 6 x 6 board
Credibility $k = 300$
Combining all scores and reductions, here is the score equation for our final score of each model:

Name		Name	Last commit message	Last commit date
Latest commit History 146 Commits
.vscode		.vscode
semester_data		semester_data
stratego		stratego
.gitignore		.gitignore
CITATION.cff		CITATION.cff
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project: Stratego LLM Test Based Games

1. Introduction

2. Project Initialization (Local Setup)

Prerequisites

Installation Steps

3. Server Configuration (TU Clausthal SSH)

Connecting via SSH

Managing Cache (Critical)

4. Setting up Ollama on SSH Server

Installation

Running the Server

Managing Models

5. Usage & Gameplay

6. Dataset & Prompt Optimization

Logging (GameLogger)

Automated Prompt Improvement

Hugging Face Dataset Integration

7. Benchmarking

8. GUI

9. Scoring

$$Score_m=100\cdot C_m\cdot (aS^{size}_m+bS^{speed}_m-cR^{inv}_m+dB_m)$$

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Project: Stratego LLM Test Based Games

1. Introduction

2. Project Initialization (Local Setup)

Prerequisites

Installation Steps

3. Server Configuration (TU Clausthal SSH)

Connecting via SSH

Managing Cache (Critical)

4. Setting up Ollama on SSH Server

Installation

Running the Server

Managing Models

5. Usage & Gameplay

6. Dataset & Prompt Optimization

Logging (GameLogger)

Automated Prompt Improvement

Hugging Face Dataset Integration

7. Benchmarking

8. GUI

9. Scoring

$$Score_m=100\cdot C_m\cdot (aS^{size}_m+bS^{speed}_m-cR^{inv}_m+dB_m)$$

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages