Skip to content

DarshanScripts/stratego

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

146 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Project: Stratego LLM Test Based Games

1. Introduction

Stratego LLM is a research framework designed to evaluate the strategic reasoning and behavioral characteristics of Large Language Models (LLMs) in an imperfect-information game setting.

Unlike static benchmarks, this project pits models (e.g., Mistral, Gemma, Llama, Qwen) against each other in the board game Stratego to analyze dynamic performance. The primary goal is to determine which model performs better by measuring:

Win Rates & Dominance: Quantitative analysis of Win/Loss/Draw ratios across 100+ match simulations.

Behavioral Profiling: Classifying models as Stable (consistent, rule-abiding) vs. Aggressive (high attack frequency, risky plays).

Efficiency: Measuring time-to-move and token consumption to determine the "cost of intelligence."

Strategic Consistency: Analyzing how often models hallucinate invalid moves versus making logically sound decisions.

The system includes an automated arena for batch matchmaking, a custom logger for dataset creation, and a prompt-optimizer that refines strategies based on match outcomes.

2. Project Initialization (Local Setup)

Follow these steps to set up the development environment on your local machine.

Prerequisites

  • Git: Ensure Git is installed.
  • Python: Python 3.8+ is recommended.

Installation Steps

  1. Clone the Repository Clone the project to your local machine (e.g., in VS Code).
git clone https://github.com/davszi/Stratego.git
cd Stratego
  1. Create and Activate Virtual Environment It is highly recommended to work within a virtual environment.
  • Windows (PowerShell)
python -m venv .venv
.\.venv\Scripts\Activate.ps1
  • Windows (CMD)
python -m venv .venv
.\.venv\Scripts\activate.bat
  • macOS / Linux
python3 -m venv .venv
source .venv/bin/activate

Note: If successful, you will see (.venv) at the start of your terminal line. To exit later, simply type deactivate.

  1. Install Dependencies Update pip and install the package in editable mode.
python -m pip install --upgrade pip
pip install -e .

To install additional dependencies for Hugging Face models:

pip install -e ".[hf]"
  1. Verify Installation Use the following commands to install environment files for Stratego Duel and Custom into the textarena folder.
stratego-install-env

3. Server Configuration (TU Clausthal SSH)

These steps are specific to the TU Clausthal cloud environment to manage disk quota and cache locations.

Connecting via SSH

Connect to the server using port forwarding for Ollama (Port 11437 in this example).

ssh -L 11437:localhost:11437 {user}@cloud-247.rz.tu-clausthal.de

Managing Cache (Critical)

To avoid filling up your home directory, redirect caches to the /scratch directory. Run these commands in your Bash terminal on the server.

  1. Create Cache Directories:
mkdir -p /scratch/{user}/vs_cache
mkdir -p /scratch/{user}/hf_cache
mkdir -p /scratch/{user}/pip_cache
  1. Export Environment Variables: Run the following to set the paths for the current session:
export VSCODE_SERVER_CACHE=/scratch/{user}/vs_cache
export HF_HOME=/scratch/{user}/hf_cache
export HUGGINGFACE_HUB_CACHE=/scratch/{user}/hf_cache
export TRANSFORMERS_CACHE=/scratch/{user}/hf_cache
export HF_DATASETS_CACHE=/scratch/{user}/hf_cache
export PIP_CACHE_DIR=/scratch/{user}/pip_cache
  1. Permanent Configuration (Optional): To make these changes permanent, edit your .cshrc file:
nano ~/.cshrc

Add lines such as setenv HF_HOME /scratch/{user}/hf_cache for each variable listed above. Press Ctrl+X to save and exit. 4. Cleanup: If you have existing cache in your home directory, clear it to free up space:

rm -r ~/.cache

Restart VS Code to apply these changes.


4. Setting up Ollama on SSH Server

If you wish to host your own Ollama instance on the server:

Installation

  1. Prepare Directories:
mkdir -p /scratch/{user}/ollama_bin
mkdir -p /scratch/{user}/ollama_model
mkdir -p /scratch/{user}/ollama_tmp
cd /scratch/{user}/ollama_bin
  1. Download and Extract:
curl -fL -o ollama-linux-amd64.tar.zst https://github.com/ollama/ollama/releases/download/v0.14.0/ollama-linux-amd64.tar.zst
tar --use-compress-program=unzstd -xvf ollama-linux-amd64.tar.zst
  1. Configure Environment: Set the paths so Ollama knows where to store models and temporary files.
export OLLAMA_MODELS=/scratch/{user}/ollama_model
export OLLAMA_TMPDIR=/scratch/{user}/ollama_tmp
export OLLAMA_HOST=0.0.0.0:{your_host_port}
export PATH="/scratch/{user}/ollama_bin/bin:$PATH"

Running the Server

It is recommended to run the server inside a tmux session so it persists after you disconnect.

  1. Start a new session: tmux new -s ollama_server
  2. Start Ollama: ollama serve
  3. Detach: Press Ctrl+B then D.

Managing Models

Open a new local terminal (connected via SSH) to interact with the running server.

  • Check available models:
curl -s http://127.0.0.1:11435/api/tags | jq -r '.models[].name'
  • Pull (Download) a new model:
curl -X POST http://127.0.0.1:{your_host_port}/api/pull \
     -H 'Content-Type: application/json' \
     -d '{"name":"mistral:7b"}'

5. Usage & Gameplay

You can run the game using the stratego command. Ensure your Ollama client is running if using local LLMs.

Basic Command:

stratego --p0 ollama:mistral:7b --p1 ollama:gemma3:1b --prompt base

Arguments:

  • --p0: Agent for Player 0 (e.g., ollama:mistral:7b or hf:TinyLlama/TinyLlama-1.1B-Chat-v1.0).
  • --p1: Agent for Player 1.
  • --prompt: The prompt strategy to use.
  • --size: Board size (NxN). Default is 10.
# Example for a smaller board
stratego --p0 ollama:mistral:7b --p1 ollama:mistral:7b --size 6

6. Dataset & Prompt Optimization

The system includes automated logging and prompt improvement mechanisms implemented in main.py.

Logging (GameLogger)

Every move, prompt, and metadata is saved to CSV logs. The cli() function initializes a GameLogger:

  • Log Directory: Controlled via --log-dir (default: logs).
  • Initial Prompt Logging: The logger captures the exact initial prompt used by the agent to ensure reproducibility.

Automated Prompt Improvement

The system automatically attempts to improve the System Prompt based on gameplay data.

  • Mechanism: The runner checks the number of CSV logs in the log directory.
  • Trigger: Every 3 games, improve_prompt() is called.
  • Logic: It analyzes recent games and updates stratego/prompts/current_prompt.txt.

Hugging Face Dataset Integration

To upload your game logs to Hugging Face:

  1. Install Libraries:
pip install huggingface huggingface_hub datasets
  1. Configuration:
  • Join your HF Organization.
  • Update ./datasets/uploader.py with your repository name.
  1. Authentication:
  • Create a WRITE token in your Hugging Face settings.
  • Run hf auth login in your terminal and paste the token (do not save as git credential).
  1. Upload: Once authenticated, use the uploader script to push logs to the dataset repository.

7. Benchmarking

Use the built-in benchmark tool to evaluate model performance over multiple games.

Command:

benchmark --p0 {model_A} --p1 {model_B} --size {N} --game {count}

Example:

benchmark --p0 llama3.2:1b --p1 gemma3:1b --size 6 --game 4

8. GUI

After installing the package, you can use command gui to run the game in graphic user interface.

9. Scoring

Since we played different number of games for each model, we needed another method to finalize the score. We made score equation based on following factors: win/draw rate, number of games that reached turn limit, number of total played games, win rate as playing as Player 1, and number of losses with invalid moves.

  • Score based on the win/draw rate of various game board. $$S^{size}m = \frac{\sum{s\in{{4,5,6}}}w_s(W_{m,s}+0.5D_{m,s})}{\sum_{s\in{{4,5,6}}}w_sN_{m,s}}$$ $m$ is model, $o$ is opponent, $w_s$ is the weight for each board size, $N_{m,s}$ is total number of the game that the model $m$ played for $s$ board size, $W_{m,s}$ as number of wins of the model $m$ with $s$ board size, and $D_{m,s}$ as number of draws fo teh model $m$ with $s$ board size.
  • Score based on the total number of games that reached turn limit. $$S^{speed}_m = 1-\frac{EndedByTernLimit_m}{N_m}$$ $N_m$ is total number of games played by the model $m$.
  • Reduction point based on the number of gamed lost by invalid moves. $$R^{inv}m = \frac{LostByInvalid_m}{N_m}$$ For the bonus point for playing as Player 1, compute win rate as player1 against an opponent. $$p1(m \rarr o)=\frac{W^{P1}{m,o}+0.5D_{m,o}}{G_{m,o}}$$ $G_{m,o}$ as total game played between the model $m$ as Player 1 and opponent $o$ as Player 2.
    • Now, compute opponent's Player 1 win rate from the set. $$p1(o \rarr m)=\frac{W^{P1}{o,m}+0.5D{o,m}}{G_{o,m}}$$ $G_{o,m}$ as total game played between the opponent $o$ as Player 1 and model $m$ as Player 2.
    • Next, compute the win rate of the model $m$ as it playing as Player 2 against same opponent $o$, means $o$ plays Player1. $$p2(m \larr o)=\frac{W^{P2}{o,m}+0.5D{o,m}}{G_{o,m}}$$
    • Compute the difference between the Player 1 win rate between model $m$ and opponent $o$. $$edge_1(m,o) = p1(m\rarr o)-p1(o\rarr m)$$
    • Next, compare the winrate from $p1$ between $p2$ of the model $m$. $$edge_2(m,o) = p1(m\rarr o)-p2(m\larr o)$$ $edge_2(m,o)$ is usually negative, since the models have higher win rate, when they play as Player2. They play biased.
    • Now compute the total bonus from the match. $$bonus_{pair}(m,o) = max(0,edge_1(m,o))+max(0,edge_2(m,o))$$
    • Adjust standard for giving weight. Set standard as less number of games to being pair. $$w_{m,o}=min(G_{m,o},G_{o,m})$$
    • Then compute raw bonus point by repeating those steps above with all opponents. $$B^{raw}m = \frac{\sum_o w{m,o}bonus_{pair}(m,o)}{\sum_o w_{m,o}}$$
    • Find the total bonus. Clamp it between 0 to 1, means the minimum is 0 and maximum is 1 for the final bonus. $$B_m=clip(\frac{B^{raw}_{m}}{0.2},0,1)$$ For here, set the limit dividing with 0.2.
  • Computed Reliability Coefficient using Buehlmann $k$. $k$ is set to 300 here, to make $C_m$ for only 100 games played models to 0.5, means half reliable. $$C_m=\sqrt{\frac{N_m}{N_m+k}}$$
  • For computing final score, we set the weights as following:
    $a=0.60$ reward as played bord size
    $b=0.15$ reward for ending game in turn limit
    $c=0.25$ deduction for playing invalid
    $d=0.20$ reward for winning as player 1
    $w_4=0.75$ weight for 4 x 4 board
    $w_5=1.00$ weight for 5 x 5 board
    $w_6=1.35$ weight for 6 x 6 board
    Credibility $k = 300$
  • Combining all scores and reductions, here is the score equation for our final score of each model:

$$Score_m=100\cdot C_m\cdot (aS^{size}_m+bS^{speed}_m-cR^{inv}_m+dB_m)$$

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages