Stratego LLM is a research framework designed to evaluate the strategic reasoning and behavioral characteristics of Large Language Models (LLMs) in an imperfect-information game setting.
Unlike static benchmarks, this project pits models (e.g., Mistral, Gemma, Llama, Qwen) against each other in the board game Stratego to analyze dynamic performance. The primary goal is to determine which model performs better by measuring:
Win Rates & Dominance: Quantitative analysis of Win/Loss/Draw ratios across 100+ match simulations.
Behavioral Profiling: Classifying models as Stable (consistent, rule-abiding) vs. Aggressive (high attack frequency, risky plays).
Efficiency: Measuring time-to-move and token consumption to determine the "cost of intelligence."
Strategic Consistency: Analyzing how often models hallucinate invalid moves versus making logically sound decisions.
The system includes an automated arena for batch matchmaking, a custom logger for dataset creation, and a prompt-optimizer that refines strategies based on match outcomes.
Follow these steps to set up the development environment on your local machine.
- Git: Ensure Git is installed.
- Python: Python 3.8+ is recommended.
- Clone the Repository Clone the project to your local machine (e.g., in VS Code).
git clone https://github.com/davszi/Stratego.git
cd Stratego
- Create and Activate Virtual Environment It is highly recommended to work within a virtual environment.
- Windows (PowerShell)
python -m venv .venv
.\.venv\Scripts\Activate.ps1
- Windows (CMD)
python -m venv .venv
.\.venv\Scripts\activate.bat
- macOS / Linux
python3 -m venv .venv
source .venv/bin/activate
Note: If successful, you will see
(.venv)at the start of your terminal line. To exit later, simply typedeactivate.
- Install Dependencies Update pip and install the package in editable mode.
python -m pip install --upgrade pip
pip install -e .
To install additional dependencies for Hugging Face models:
pip install -e ".[hf]"
- Verify Installation
Use the following commands to install environment files for Stratego Duel and Custom into the
textarenafolder.
stratego-install-env
These steps are specific to the TU Clausthal cloud environment to manage disk quota and cache locations.
Connect to the server using port forwarding for Ollama (Port 11437 in this example).
ssh -L 11437:localhost:11437 {user}@cloud-247.rz.tu-clausthal.de
To avoid filling up your home directory, redirect caches to the /scratch directory. Run these commands in your Bash terminal on the server.
- Create Cache Directories:
mkdir -p /scratch/{user}/vs_cache
mkdir -p /scratch/{user}/hf_cache
mkdir -p /scratch/{user}/pip_cache
- Export Environment Variables: Run the following to set the paths for the current session:
export VSCODE_SERVER_CACHE=/scratch/{user}/vs_cache
export HF_HOME=/scratch/{user}/hf_cache
export HUGGINGFACE_HUB_CACHE=/scratch/{user}/hf_cache
export TRANSFORMERS_CACHE=/scratch/{user}/hf_cache
export HF_DATASETS_CACHE=/scratch/{user}/hf_cache
export PIP_CACHE_DIR=/scratch/{user}/pip_cache
- Permanent Configuration (Optional):
To make these changes permanent, edit your
.cshrcfile:
nano ~/.cshrc
Add lines such as setenv HF_HOME /scratch/{user}/hf_cache for each variable listed above. Press Ctrl+X to save and exit.
4. Cleanup:
If you have existing cache in your home directory, clear it to free up space:
rm -r ~/.cache
Restart VS Code to apply these changes.
If you wish to host your own Ollama instance on the server:
- Prepare Directories:
mkdir -p /scratch/{user}/ollama_bin
mkdir -p /scratch/{user}/ollama_model
mkdir -p /scratch/{user}/ollama_tmp
cd /scratch/{user}/ollama_bin
- Download and Extract:
curl -fL -o ollama-linux-amd64.tar.zst https://github.com/ollama/ollama/releases/download/v0.14.0/ollama-linux-amd64.tar.zst
tar --use-compress-program=unzstd -xvf ollama-linux-amd64.tar.zst
- Configure Environment: Set the paths so Ollama knows where to store models and temporary files.
export OLLAMA_MODELS=/scratch/{user}/ollama_model
export OLLAMA_TMPDIR=/scratch/{user}/ollama_tmp
export OLLAMA_HOST=0.0.0.0:{your_host_port}
export PATH="/scratch/{user}/ollama_bin/bin:$PATH"
It is recommended to run the server inside a tmux session so it persists after you disconnect.
- Start a new session:
tmux new -s ollama_server - Start Ollama:
ollama serve - Detach: Press
Ctrl+BthenD.
Open a new local terminal (connected via SSH) to interact with the running server.
- Check available models:
curl -s http://127.0.0.1:11435/api/tags | jq -r '.models[].name'
- Pull (Download) a new model:
curl -X POST http://127.0.0.1:{your_host_port}/api/pull \
-H 'Content-Type: application/json' \
-d '{"name":"mistral:7b"}'
You can run the game using the stratego command. Ensure your Ollama client is running if using local LLMs.
Basic Command:
stratego --p0 ollama:mistral:7b --p1 ollama:gemma3:1b --prompt base
Arguments:
--p0: Agent for Player 0 (e.g.,ollama:mistral:7borhf:TinyLlama/TinyLlama-1.1B-Chat-v1.0).--p1: Agent for Player 1.--prompt: The prompt strategy to use.--size: Board size (NxN). Default is 10.
# Example for a smaller board
stratego --p0 ollama:mistral:7b --p1 ollama:mistral:7b --size 6
The system includes automated logging and prompt improvement mechanisms implemented in main.py.
Every move, prompt, and metadata is saved to CSV logs. The cli() function initializes a GameLogger:
- Log Directory: Controlled via
--log-dir(default:logs). - Initial Prompt Logging: The logger captures the exact initial prompt used by the agent to ensure reproducibility.
The system automatically attempts to improve the System Prompt based on gameplay data.
- Mechanism: The runner checks the number of CSV logs in the log directory.
- Trigger: Every 3 games,
improve_prompt()is called. - Logic: It analyzes recent games and updates
stratego/prompts/current_prompt.txt.
To upload your game logs to Hugging Face:
- Install Libraries:
pip install huggingface huggingface_hub datasets
- Configuration:
- Join your HF Organization.
- Update
./datasets/uploader.pywith your repository name.
- Authentication:
- Create a WRITE token in your Hugging Face settings.
- Run
hf auth loginin your terminal and paste the token (do not save as git credential).
- Upload: Once authenticated, use the uploader script to push logs to the dataset repository.
Use the built-in benchmark tool to evaluate model performance over multiple games.
Command:
benchmark --p0 {model_A} --p1 {model_B} --size {N} --game {count}
Example:
benchmark --p0 llama3.2:1b --p1 gemma3:1b --size 6 --game 4
After installing the package, you can use command gui to run the game in graphic user interface.
Since we played different number of games for each model, we needed another method to finalize the score. We made score equation based on following factors: win/draw rate, number of games that reached turn limit, number of total played games, win rate as playing as Player 1, and number of losses with invalid moves.
- Score based on the win/draw rate of various game board.
$$S^{size}m = \frac{\sum{s\in{{4,5,6}}}w_s(W_{m,s}+0.5D_{m,s})}{\sum_{s\in{{4,5,6}}}w_sN_{m,s}}$$
$m$ is model,$o$ is opponent,$w_s$ is the weight for each board size,$N_{m,s}$ is total number of the game that the model$m$ played for$s$ board size,$W_{m,s}$ as number of wins of the model$m$ with$s$ board size, and$D_{m,s}$ as number of draws fo teh model$m$ with$s$ board size. - Score based on the total number of games that reached turn limit.
$$S^{speed}_m = 1-\frac{EndedByTernLimit_m}{N_m}$$ $N_m$ is total number of games played by the model$m$ . - Reduction point based on the number of gamed lost by invalid moves.
$$R^{inv}m = \frac{LostByInvalid_m}{N_m}$$
For the bonus point for playing as Player 1, compute win rate as player1 against an opponent.
$$p1(m \rarr o)=\frac{W^{P1}{m,o}+0.5D_{m,o}}{G_{m,o}}$$
$G_{m,o}$ as total game played between the model$m$ as Player 1 and opponent$o$ as Player 2.- Now, compute opponent's Player 1 win rate from the set.
$$p1(o \rarr m)=\frac{W^{P1}{o,m}+0.5D{o,m}}{G_{o,m}}$$
$G_{o,m}$ as total game played between the opponent$o$ as Player 1 and model$m$ as Player 2. - Next, compute the win rate of the model
$m$ as it playing as Player 2 against same opponent$o$ , means$o$ plays Player1. $$p2(m \larr o)=\frac{W^{P2}{o,m}+0.5D{o,m}}{G_{o,m}}$$ - Compute the difference between the Player 1 win rate between model
$m$ and opponent$o$ .$$edge_1(m,o) = p1(m\rarr o)-p1(o\rarr m)$$ - Next, compare the winrate from
$p1$ between$p2$ of the model$m$ .$$edge_2(m,o) = p1(m\rarr o)-p2(m\larr o)$$ $edge_2(m,o)$ is usually negative, since the models have higher win rate, when they play as Player2. They play biased. - Now compute the total bonus from the match.
$$bonus_{pair}(m,o) = max(0,edge_1(m,o))+max(0,edge_2(m,o))$$ - Adjust standard for giving weight. Set standard as less number of games to being pair.
$$w_{m,o}=min(G_{m,o},G_{o,m})$$ - Then compute raw bonus point by repeating those steps above with all opponents. $$B^{raw}m = \frac{\sum_o w{m,o}bonus_{pair}(m,o)}{\sum_o w_{m,o}}$$
- Find the total bonus.
Clamp it between 0 to 1, means the minimum is 0 and maximum is 1 for the final bonus.
$$B_m=clip(\frac{B^{raw}_{m}}{0.2},0,1)$$ For here, set the limit dividing with 0.2.
- Now, compute opponent's Player 1 win rate from the set.
$$p1(o \rarr m)=\frac{W^{P1}{o,m}+0.5D{o,m}}{G_{o,m}}$$
- Computed Reliability Coefficient using Buehlmann
$k$ .$k$ is set to 300 here, to make$C_m$ for only 100 games played models to 0.5, means half reliable.$$C_m=\sqrt{\frac{N_m}{N_m+k}}$$ - For computing final score, we set the weights as following:
$a=0.60$ reward as played bord size
$b=0.15$ reward for ending game in turn limit
$c=0.25$ deduction for playing invalid
$d=0.20$ reward for winning as player 1
$w_4=0.75$ weight for 4 x 4 board
$w_5=1.00$ weight for 5 x 5 board
$w_6=1.35$ weight for 6 x 6 board
Credibility$k = 300$ - Combining all scores and reductions, here is the score equation for our final score of each model: