## gpt-3.5-turbo-instruct vs State of the Art Reasoning Models

This notebook plays chess games between GPT 3.5 Turbo Instruct (released in 2023) and three state of the art reasoning models released in 2025 (Grok 4, OpenAI o3, and Gemini 2.5 Pro). Two games are played per reasoning model opponent — one with GPT 3.5 as white; one as black.

Structured outputs were not enforced — the reasoning models were instructed to only return the selected move as the final response. An illegal move or malformatted response is considered an instant loss.

### Results

GPT 3.5 Turbo Instruct wins all six games:
```text
1. GPT 3.5 vs Grok 4         : GPT 3.5 wins by checkmate
2. Grok 4 vs GPT 3.5         : Grok 4 loses by illegal move 'Re4' (GPT 3.5 has a 1240 centipawn advantage)
3. GPT 3.5 vs OpenAI o3      : OpenAI o3 loses by illegal move 'Qh7' (GPT 3.5 has mate in one)
4. OpenAI o3 vs GPT 3.5      : GPT 3.5 wins by checkmate
5. GPT 3.5 vs Gemini 2.5 Pro : GPT 3.5 wins by checkmate
6. Gemini 2.5 Pro vs GPT 3.5 : GPT 3.5 wins by checkmate
```

### Other observations:
- Runtimes:
  - Grok 4 is extremely slow, with game durations of 4h40m and 5h28m. I'm not sure if this is a rate limiting issue, slow token generation, excessive yapping, or something else.
  - OpenAI o3's game durations were 1h39m and 2h44m.
  - Gemini 2.5 Pro is extremely fast, with game durations of 28s and 26s. This implies it elected to do little-to-no reasoning, which could be a misinterpretation of the promp instructions: "RESPOND DIRECTLY WITH ONLY THE MOVE, no other text."

In [None]:
%load_ext autoreload
%autoreload 2

from datetime import datetime
from pathlib import Path

import chess
from tqdm.notebook import tqdm

from llm_chess.core.enums import APIResponseFormat
from llm_chess.core.game_manager import GameManager
from llm_chess.players.llm.openai import OpenAIPlayer
from llm_chess.players.llm.openai_instruct import GPT3p5TurboInstructPlayer
from llm_chess.players.llm.gemini import GeminiPlayer
from llm_chess.players.llm.grok import GrokPlayer
from llm_chess.prompts.pgn import PGNPromptConfig
from llm_chess.utils.write import write_board_to_pgn_file

In [None]:
PROJECT_ROOT = Path().cwd().parents[0]
LOG_DIR = PROJECT_ROOT / "game_logs" / "gpt_3p5_turbo_instruct_vs_sota_reasoning_models"

pgn_prompt_config = PGNPromptConfig(
    api_response_format=APIResponseFormat.TEXT,
    move_response_has_leading_space=False,
    prompt_prefix="Using SAN notation, respond with the next best chess move for the PGN game below. RESPOND DIRECTLY WITH ONLY THE MOVE, no other text.",
)

gpt = GPT3p5TurboInstructPlayer()

grok = GrokPlayer(
    name="Grok 4",
    model="grok-4-0709",
    prompt_config=pgn_prompt_config,
)

o3 = OpenAIPlayer(
    name="OpenAI o3",
    model="o3-2025-04-16",
    prompt_config=pgn_prompt_config,
    temperature=1.0,  # o3 only supports temperature=1
)

gemini = GeminiPlayer(
    name="Gemini Pro",
    model="gemini-2.5-pro",
    prompt_config=pgn_prompt_config,
)

In [None]:
games = [
    (gpt, grok),
    (grok, gpt),
    (gpt, o3),
    (o3, gpt),
    (gpt, gemini),
    (gemini, gpt),
]

In [None]:
for i, (white, black) in tqdm(enumerate(games, start=1), total=len(games), desc="Playing games..."):
    print(f"\nGame {i}/{len(games)}")

    # Play the game
    manager = GameManager()
    board = chess.Board()
    board, result = manager.play_game(
        white,
        black,
        board=board,
        displayer=None,
        print_move=True,
        sleep_time=0.2,
    )

    # Log the game
    timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
    file_name = f"{timestamp}_game_{i:05d}.pgn"
    write_board_to_pgn_file(
        board=board,
        write_dir=LOG_DIR,
        file_name=file_name,
        white_name=white.name,
        black_name=black.name,
        result=result,
    )

    if result == "1-0":
        print(f"  {white.name} won as white.")
    elif result == "0-1":
        print(f"  {black.name} won as black.")
    elif result == "1/2-1/2":
        print(f"  The game was a draw ({white.name} as white).")
    else:
        print(f"  Game result: {result} ({white.name} as white).")

Game 1/6
    GPT-3.5 Turbo Instruct plays: e4
    Grok 4                 plays: c5
    GPT-3.5 Turbo Instruct plays: Nf3
    Grok 4                 plays: d6
    GPT-3.5 Turbo Instruct plays: d4
    Grok 4                 plays: cxd4
    GPT-3.5 Turbo Instruct plays: Nxd4
    Grok 4                 plays: Nf6
    GPT-3.5 Turbo Instruct plays: Nc3
    Grok 4                 plays: a6
    GPT-3.5 Turbo Instruct plays: Bg5
    Grok 4                 plays: e6
    GPT-3.5 Turbo Instruct plays: f4
    Grok 4                 plays: Be7
    GPT-3.5 Turbo Instruct plays: Qf3
    Grok 4                 plays: Qc7
    GPT-3.5 Turbo Instruct plays: O-O-O
    Grok 4                 plays: Nbd7
    GPT-3.5 Turbo Instruct plays: g4
    Grok 4                 plays: b5
    GPT-3.5 Turbo Instruct plays: Bxf6
    Grok 4                 plays: Nxf6
    GPT-3.5 Turbo Instruct plays: g5
    Grok 4                 plays: Nd7
    GPT-3.5 Turbo Instruct plays: f5
    Grok 4                 plays: Nc5
    GPT