In [None]:
import numpy as np
from nim_env import NimEnv, OptimalPlayer

# Nim environment

Our 2nd game is the famous game of Nim. You can read about the game and its rules here: https://en.wikipedia.org/wiki/Nim

**Important note:** We consider the normal (i.e. not misere) game: the player taking the last object *wins*.

We implemented the game as an environment in the style of games in the [Python GYM library](https://gym.openai.com/). The commented source code is available in the file "nim_env.py". Here, we give a brief introduction to the environment and how it can be used.

### Initialization and attributes

Given a random seed, you can initialize the environment / game in a random state (3 heaps with between 1 to 7 sticks in each heap) as following:

In [None]:
env = NimEnv(seed = 3)

Which then has the following attributes with the corresponding initial values:

In [None]:
env.__dict__

The game is played by two players: player 0 and player 1. The attribute 'current_player' shows whose turn it is. We assume that player 0 always plays first.

The attribute 'heaps' is a numpy array of size 3 and presents the board in the real game and the state $s_t$ in the reinfocement learning language. Each number shows available number of sticks in each heap. The attribute 'heap_avail' shows which heaps can be used. 
        
The attribute 'end' shows if the game is over or not, and the attribute 'winner' shows the winner of the game.

You can use function 'render' to visualize the current position of the board:

In [None]:
env.render()

### Taking actions

The game environment will recieve action from two players in turn and update the heaps. At each time, one player can take the action $a_t$, where $a_t$=action can is a vector of 2 integer: action[0] $\in \{ 1,2,3 \}$ is the number of heap and action[1] > 0 the number of sticks to be taken from that heap.

Function 'step' is used to recieve the action of the player, update the grid:

In [None]:
env.step([1,2])

In [None]:
env.render()

In [None]:
env.__dict__

In [None]:
env.step([2,3])

In [None]:
env.render()

In [None]:
env.__dict__

But not all actions are available at each time: One cannot take sticks from an unavailable action is taken. There is an error if an unavailable action is taken:

In [None]:
env.step([2,3])

Not taking any stick is also unavailable:

In [None]:
env.step([2,0])

### Reward

Reward is always 0 until the end of the game. When the game is over, the reward is 1 if you win the game and -1 if you lose. Function 'observe' can be used after each step to recieve the new state $s_t$, whether the game is over, and the winner, and function 'reward' to get the reward value $r_t$:

In [None]:
env.observe()

In [None]:
env.reward(player=0)

In [None]:
env.reward(player=1)

An example of finishing the game:

In [None]:
print("Player = " + str(env.current_player))
env.step([2,2])
env.render()
print("Player = " + str(env.current_player))
env.step([3,6])
env.render()

In [None]:
env.observe()

In [None]:
env.reward(player=0)

In [None]:
env.reward(player=1)

# Optimal policy for the Nim environment

Fortunately, we know the exact optimal policy for Nim. We have implemented and $\epsilon$-greedy version of optimal polciy which you can use for the project.

In [None]:
env.reset(seed=6);

In [None]:
env.render()

In [None]:
env.__dict__

In [None]:
opt_player = OptimalPlayer(epsilon = 0., player = 0)

In [None]:
opt_player.act(env.heaps)

In [None]:
opt_player.player

### An example of optimal player playing against random player

In [None]:
Turns = np.array([0,1])
for i in range(5):
    env.reset()
    heaps, _, __ = env.observe()
    Turns = Turns[np.random.permutation(2)]
    player_opt = OptimalPlayer(epsilon=0., player=Turns[0])
    player_rnd = OptimalPlayer(epsilon=1., player=Turns[1])
    while not env.end:
        if env.current_player == player_opt.player:
            move = player_opt.act(heaps)
        else:
            move = player_rnd.act(heaps)

        heaps, end, winner = env.step(move)

        if end:
            print('-------------------------------------------')
            print('Game end, winner is player ' + str(winner))
            print('Optimal player = ' +  str(Turns[0]))
            print('Random player = ' +  str(Turns[1]))
            env.reset()
            break


### An example of optimal player playing against optimal player

In [None]:
Turns = np.array([0,1])
for i in range(5):
    env.reset()
    heaps, _, __ = env.observe()
    Turns = Turns[np.random.permutation(2)]
    player_opt_1 = OptimalPlayer(epsilon=0., player=Turns[0])
    player_opt_2 = OptimalPlayer(epsilon=0., player=Turns[1])
    while not env.end:
        if env.current_player == player_opt_1.player:
            move = player_opt_1.act(heaps)
        else:
            move = player_opt_2.act(heaps)

        heaps, end, winner = env.step(move)

        if end:
            print('-------------------------------------------')
            print('Game end, winner is player ' + str(winner))
            print('Optimal player 1 = ' +  str(Turns[0]))
            print('Optimal player 2 = ' +  str(Turns[1]))
            env.reset()
            break
