# A Knucklebones AI

Knucklebones is a highly random dice game developed by the studio [Massive Monster](https://massivemonster.co) available to play online [here](https://knucklebones.io). There you can also find a summary of the rules of the game; a more detailed explanation also given on the [Fandom Wiki](https://cult-of-the-lamb.fandom.com/wiki/Knucklebones). The goal of this project was to train an AI for Knucklebones by self play using the Reinforcement Learning library [Stable-Baselines3](https://stable-baselines3.readthedocs.io).

In [9]:
import numpy as np
import gymnasium as gym
from gymnasium import spaces
from stable_baselines3 import PPO
from stable_baselines3.common.env_checker import check_env
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.vec_env import DummyVecEnv
from typing import Optional, Tuple

We start by modelling the game according to OpenAI's Gym Standard for Reinforcement Learning. The RL model will provide a ranking of its preferences for the three options (encoded as one of the 6 permutations) and the game will chooses the valid actions with the highest preference. This prevents the AI from making invalid actions, so placing a dice in an already full column.

In [10]:
class KnucklebonesEnv(gym.Env):
    def __init__(self):
        super(KnucklebonesEnv, self).__init__()
        self.board = np.zeros((2, 3, 3), dtype=int)
        self.current_player = 0
        self.current_dice = None

        # action indicates preferences between three columns encoded as permutation
        self.action_space = spaces.Discrete(6)
        self.observation_space = spaces.Box(low=0, high=6, shape=(19,), dtype=np.int32)

        # Define the permutations
        self.permutations = [
            [0, 1, 2],
            [0, 2, 1],
            [1, 0, 2],
            [1, 2, 0],
            [2, 0, 1],
            [2, 1, 0]
        ]

    def reset(self, seed=None, options=None):
        super().reset(seed=seed)
        self.board = np.zeros((2, 3, 3), dtype=int)
        self.current_player = 0
        self.current_dice = self.roll_dice()
        return self.get_observation(), {}

    def step(self, action):
        # Convert the action to a permutation
        column_order = self.permutations[action]

        for column in column_order:
            if self.is_valid_action(column):
                self.place_dice(self.current_player, column, self.current_dice)
                self.remove_opponent_dice(1 - self.current_player, column, self.current_dice)
                break

        terminated = self.is_game_over()
        reward = self.calculate_score(self.current_player)
        
        self.current_player = 1 - self.current_player
        self.current_dice = self.roll_dice()

        return self.get_observation(), float(reward), bool(terminated), False, {}

    def roll_dice(self):
        return self.np_random.integers(1, 7)

    def is_valid_action(self, action):
        return 0 <= action < 3 and np.any(self.board[self.current_player, action] == 0)

    def place_dice(self, player, column, value):
        empty_spots = np.where(self.board[player, column] == 0)[0]
        if empty_spots.size > 0:
            self.board[player, column, empty_spots[0]] = value

    def remove_opponent_dice(self, opponent, column, value):
        self.board[opponent, column] = np.where(self.board[opponent, column] == value, 0, self.board[opponent, column])
        self.board[opponent, column] = np.sort(self.board[opponent, column])[::-1]

    def is_game_over(self):
        return np.all(self.board[0] != 0) or np.all(self.board[1] != 0)

    def calculate_score(self, player):
        score = 0
        for column in self.board[player]:
            unique, counts = np.unique(column[column != 0], return_counts=True)
            for value, count in zip(unique, counts):
                score += value * count
        return score

    def get_observation(self):
        return np.concatenate([self.board.flatten(), [self.current_dice]]).astype(np.int32)

    def render(self):
        for player in range(2):
            print(f"Player {player + 1}:")
            for row in range(3):
                print(" ".join(f"{self.board[player, col, row]:2d}" for col in range(3)))
            print()
        print(f"Current player: {self.current_player + 1}, Current dice: {self.current_dice}")

    def get_current_player(self):
        return self.current_player

The model will be trained using self-play, so by adversarial playing against itself across thousands of games. Since Stable-Baselines3 does not have native support for environments with multiple agents, we wrap the game environment in a multienvironment containing a separate opponent model.

In [11]:
class KnucklebonesMultiEnv(KnucklebonesEnv):
    def __init__(self):
        super().__init__()
        self.player_1_obs = None
        self.player_2_obs = None

    def reset(self, seed=None, options=None):
        obs, info = super().reset(seed=seed, options=options)
        self.player_1_obs = obs
        self.player_2_obs = self._flip_observation(obs)
        return obs, info

    def step(self, action):
        obs, reward, terminated, truncated, info = super().step(action)
        self.player_1_obs = obs if self.current_player == 0 else self._flip_observation(obs)
        self.player_2_obs = obs if self.current_player == 1 else self._flip_observation(obs)
        return obs, reward, terminated, truncated, info

    def _flip_observation(self, obs):
        flipped_obs = obs.copy()
        flipped_obs[:9], flipped_obs[9:18] = obs[9:18], obs[:9]
        return flipped_obs

    def get_player_obs(self, player):
        return self.player_1_obs if player == 0 else self.player_2_obs

class SelfPlayEnv(gym.Env):
    def __init__(self, env):
        self.env = env
        self.opponent = None
        self.action_space = env.action_space
        self.observation_space = env.observation_space

    def reset(self, seed=None, options=None):
        return self.env.reset(seed=seed, options=options)

    def step(self, action):
        obs, reward, done, truncated, info = self.env.step(action)
        if not done:
            if self.opponent:
                opponent_obs = self.env.get_player_obs(self.env.get_current_player())
                opponent_action, _ = self.opponent.predict(opponent_obs, deterministic=True)
                obs, reward_op, done, truncated, info = self.env.step(opponent_action)
                # reward as difference to opponent score
                reward -= reward_op
        return obs, reward, done, truncated, info

    def set_opponent(self, opponent):
        self.opponent = opponent

    def render(self):
        return self.env.render()

env = KnucklebonesMultiEnv()
# check that environment is compatible with the OpenAI gym standard
check_env(env)

As a reward function we will use the difference between the score of the model player and the score of the opponent in the next round. The aim is to encourage the model to maximize its own score but also prevent the opponent from achieving a high score. Now we need a function to evaluate two models playing against each other:

In [12]:
def play_example_game(env, model_1=None, model_2=None, render: bool = True):
    obs, _ = env.reset()
    terminated = False
    max_steps = 100
    player_scores = [0, 0]

    for _ in range(max_steps):
        if render:
            env.render()

        current_player = env.get_current_player()
        current_obs = env.get_player_obs(current_player)

        if current_player == 0:
            action = model_1.predict(current_obs, deterministic=True)[0] if model_1 else env.action_space.sample()
        else:
            action = model_2.predict(current_obs, deterministic=True)[0] if model_2 else env.action_space.sample()

        obs, reward, terminated, _, _ = env.step(action)
        player_scores[current_player] = env.calculate_score(current_player)
        
        if render:
            print(f"Player {current_player + 1} Action: {action}")
            print(f"Action permutation: {env.permutations[action]}")
            print(f"Player 1 Score: {player_scores[0]}, Player 2 Score: {player_scores[1]}")
            print("-" * 40)
        
        if terminated:
            break

    if render:
        env.render()
        print(f"Game Over. Final Scores - Player 1: {player_scores[0]}, Player 2: {player_scores[1]}")
    
    return player_scores[0] - player_scores[1]  # Return the score difference

def evaluate_models(env, model_1, model_2, n_games: int = 1000):
    model_1_wins = 0
    model_2_wins = 0
    draws = 0

    for _ in range(n_games):
        reward = play_example_game(env, model_1, model_2, render=False)
        if reward > 0:
            model_1_wins += 1
        elif reward < 0:
            model_2_wins += 1
        else:
            draws += 1

    print(f"Model 1 wins: {model_1_wins}")
    print(f"Model 2 wins: {model_2_wins}")
    print(f"Draws: {draws}")
    print("")

To check that our game implementation is working correctly, we can have a look at an example game between models making uniformly random actions:

In [13]:
play_example_game(env)

Player 1:
 0  0  0
 0  0  0
 0  0  0

Player 2:
 0  0  0
 0  0  0
 0  0  0

Current player: 1, Current dice: 6
Player 1 Action: 5
Action permutation: [2, 1, 0]
Player 1 Score: 6, Player 2 Score: 0
----------------------------------------
Player 1:
 0  0  6
 0  0  0
 0  0  0

Player 2:
 0  0  0
 0  0  0
 0  0  0

Current player: 2, Current dice: 5
Player 2 Action: 1
Action permutation: [0, 2, 1]
Player 1 Score: 6, Player 2 Score: 5
----------------------------------------
Player 1:
 0  0  6
 0  0  0
 0  0  0

Player 2:
 5  0  0
 0  0  0
 0  0  0

Current player: 1, Current dice: 4
Player 1 Action: 2
Action permutation: [1, 0, 2]
Player 1 Score: 10, Player 2 Score: 5
----------------------------------------
Player 1:
 0  4  6
 0  0  0
 0  0  0

Player 2:
 5  0  0
 0  0  0
 0  0  0

Current player: 2, Current dice: 4
Player 2 Action: 0
Action permutation: [0, 1, 2]
Player 1 Score: 10, Player 2 Score: 9
----------------------------------------
Player 1:
 0  4  6
 0  0  0
 0  0  0

Player 2

1

The model will be trained with proximal policy optimization, a detailed explanation of the mathematical background is given in this [paper](https://arxiv.org/pdf/1707.06347). 

In [14]:
def create_model(env, model_path: Optional[str] = None) -> PPO:
    if model_path:
        return PPO.load(model_path, env=env)
    return PPO("MlpPolicy", env, verbose=0)

Now the self play needs to be implemented, where the model is trained across multiple iterations and always playing against the best model from the last iteration.

In [15]:
def self_play_training(env, total_timesteps: int = 10**5) -> PPO:
    self_play_env = SelfPlayEnv(env)
    model = create_model(DummyVecEnv([lambda: self_play_env]))
    opponent = None

    for i in range(6):  # 6 iterations of self-play
        print(f"Self-play iteration {i+1}")
        
        # Set opponent (random for first iteration, previous model for subsequent iterations)
        self_play_env.set_opponent(opponent)
        
        # Train model against opponent
        model.learn(total_timesteps=total_timesteps)

        # Evaluate model against initial opponent
        print("\nEvaluating trained model against initial player:")
        evaluate_models(env, model, None)

        opponent = model

    return model

Finally, the actual training process can begin:

In [16]:
model = self_play_training(env)
model.save("knucklebones_model")

Self-play iteration 1

Evaluating trained model against initial player:
Model 1 wins: 508
Model 2 wins: 462
Draws: 30

Self-play iteration 2

Evaluating trained model against initial player:
Model 1 wins: 620
Model 2 wins: 340
Draws: 40

Self-play iteration 3

Evaluating trained model against initial player:
Model 1 wins: 592
Model 2 wins: 379
Draws: 29

Self-play iteration 4

Evaluating trained model against initial player:
Model 1 wins: 623
Model 2 wins: 343
Draws: 34

Self-play iteration 5

Evaluating trained model against initial player:
Model 1 wins: 600
Model 2 wins: 371
Draws: 29

Self-play iteration 6

Evaluating trained model against initial player:
Model 1 wins: 595
Model 2 wins: 370
Draws: 35



We observe how the performance increases in the beginning, but quickly levels off. This is not unexpected for a game which is highly dependent on dice rolls, allowing only a small advantage gain from strategy. Let's watch the trained model play an example game:

In [17]:
play_example_game(env, model)

Player 1:
 0  0  0
 0  0  0
 0  0  0

Player 2:
 0  0  0
 0  0  0
 0  0  0

Current player: 1, Current dice: 6
Player 1 Action: 4
Action permutation: [2, 0, 1]
Player 1 Score: 6, Player 2 Score: 0
----------------------------------------
Player 1:
 0  0  6
 0  0  0
 0  0  0

Player 2:
 0  0  0
 0  0  0
 0  0  0

Current player: 2, Current dice: 6
Player 2 Action: 3
Action permutation: [1, 2, 0]
Player 1 Score: 6, Player 2 Score: 6
----------------------------------------
Player 1:
 0  0  6
 0  0  0
 0  0  0

Player 2:
 0  6  0
 0  0  0
 0  0  0

Current player: 1, Current dice: 2
Player 1 Action: 2
Action permutation: [1, 0, 2]
Player 1 Score: 8, Player 2 Score: 6
----------------------------------------
Player 1:
 0  2  6
 0  0  0
 0  0  0

Player 2:
 0  6  0
 0  0  0
 0  0  0

Current player: 2, Current dice: 3
Player 2 Action: 2
Action permutation: [1, 0, 2]
Player 1 Score: 8, Player 2 Score: 9
----------------------------------------
Player 1:
 0  2  6
 0  0  0
 0  0  0

Player 2:


2

(c) Mia Müßig