<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

Now that we've been able to implement the basic functionalities of the game in Python, our next step is to implement it as a `gym.Env` so that it can be used easily to train reinforcement learning models. As a starting point, we will be following the docs: https://www.gymlibrary.dev/content/environment_creation/.

They remind us to add the `metadata` attribute to specify the render-mode (`human`, `rgb_array` or `ansi`) and the framerate. Every environment should support the render-mode `None`, and you don't need to add it explicitly.

As we have almost defined the environment completelly before, we don't need to add a lot of information to this class (we can inherit from the one we defined before); but we have to explicitly define the attributes `self.observation_space` and `self.action_space`.

- `self.action_space`: Our agents can only choose them column in which they want to place the dice, so our action space is going to be restricted to a number between 0 and 2 (assuming the board has 3 columns, but could depend on it directly).

- `self.observation_space`: What does an agent see? It makes sense to provide all the information available: Its current board, the opponent's board and the dice it has to place. We can implement this easily with a `spaces.Dict`. The different boards can be encoded as `spaces.Box` with `dtype=np.uint8` so that they are discrete environments by with an array-like shape. It should work very similarly with a `spaces.MultiDiscrete` environment for example.

In [1]:
#|output: asis
#| echo: false
show_doc(MatatenaEnv)

---

[source](https://github.com/Jorgvt/matatena/blob/main/matatena/gym.py#L18){target="_blank" style="float:right; font-size:smaller"}

### MatatenaEnv

>      MatatenaEnv (*args, **kwds)

`gym`-ready implementation of `Game`.

In [None]:
matatena = MatatenaEnv()
matatena

Player 1 (0.0) | Player 2 (0.0) *
[[0. 0. 0.]    | [[0. 0. 0.]     
 [0. 0. 0.]    |  [0. 0. 0.]     
 [0. 0. 0.]]   |  [0. 0. 0.]]    

In [None]:
matatena.observation_space.sample()

OrderedDict([('agent',
              array([[6, 1, 3],
                     [6, 3, 6],
                     [3, 6, 5]], dtype=uint8)),
             ('dice', 2),
             ('opponent',
              array([[2, 1, 5],
                     [1, 5, 3],
                     [6, 2, 2]], dtype=uint8))])

In [None]:
matatena.action_space.sample()

0

# Reset

The `reset` method will be called to initiate a new episode. It should be called as well when  a `done` signal is issued by the environment to reset it. It must accept a `reset` parameter. 

It is recommended to use the random generator included when inheriting from `gym.Env`(`self.np_random`), but we need to remember to call `super().reset(seed=seed)` to make sure that the environment is seeded correctly. 

Finally, it must return a tuple of the initial observation and some auxiliary information (which will be `None` in our case).

In [2]:
#|output: asis
#| echo: false
show_doc(MatatenaEnv.reset)

---

[source](https://github.com/Jorgvt/matatena/blob/main/matatena/gym.py#L38){target="_blank" style="float:right; font-size:smaller"}

### MatatenaEnv.reset

>      MatatenaEnv.reset (seed:int=None, options=None)

Reinitializes the environment and returns the initial state.

|    | **Type** | **Default** | **Details** |
| -- | -------- | ----------- | ----------- |
| seed | int | None | Seed to control the RNG. |
| options | NoneType | None | Additional options. |

In [None]:
matatena = MatatenaEnv()
matatena

Player 1 (0.0) * | Player 2 (0.0)
[[0. 0. 0.]      | [[0. 0. 0.]   
 [0. 0. 0.]      |  [0. 0. 0.]   
 [0. 0. 0.]]     |  [0. 0. 0.]]  

In [None]:
matatena.reset()

({'agent': array([[0., 0., 0.],
         [0., 0., 0.],
         [0., 0., 0.]]),
  'opponent': array([[0., 0., 0.],
         [0., 0., 0.],
         [0., 0., 0.]]),
  'dice': 4},
 None)

In [None]:
matatena

Player 1 (0.0) * | Player 2 (0.0)
[[0. 0. 0.]      | [[0. 0. 0.]   
 [0. 0. 0.]      |  [0. 0. 0.]   
 [0. 0. 0.]]     |  [0. 0. 0.]]  

# Step

The `.step()` method contains the logic of the environment. Must accept an `action`, compute the state of the environment after applying the `action` and return a 4-tuple: `(observation, reward, done, info)`.

> In our case, the `action` should be the column in which the agent wants to place the rolled dice.

In [3]:
#|output: asis
#| echo: false
show_doc(MatatenaEnv.step)

---

[source](https://github.com/Jorgvt/matatena/blob/main/matatena/gym.py#L63){target="_blank" style="float:right; font-size:smaller"}

### MatatenaEnv.step

>      MatatenaEnv.step (action)

Run one timestep of the environment's dynamics.

When end of episode is reached, you are responsible for calling :meth:`reset` to reset this environment's state.
Accepts an action and returns either a tuple `(observation, reward, terminated, truncated, info)`, or a tuple
(observation, reward, done, info). The latter is deprecated and will be removed in future versions.

Args:
    action (ActType): an action provided by the agent

Returns:
    observation (object): this will be an element of the environment's :attr:`observation_space`.
        This may, for instance, be a numpy array containing the positions and velocities of certain objects.
    reward (float): The amount of reward returned as a result of taking the action.
    terminated (bool): whether a `terminal state` (as defined under the MDP of the task) is reached.
        In this case further step() calls could return undefined results.
    truncated (bool): whether a truncation condition outside the scope of the MDP is satisfied.
        Typically a timelimit, but could also be used to indicate agent physically going out of bounds.
        Can be used to end the episode prematurely before a `terminal state` is reached.
    info (dictionary): `info` contains auxiliary diagnostic information (helpful for debugging, learning, and logging).
        This might, for instance, contain: metrics that describe the agent's performance state, variables that are
        hidden from observations, or individual reward terms that are combined to produce the total reward.
        It also can contain information that distinguishes truncation and termination, however this is deprecated in favour
        of returning two booleans, and will be removed in a future version.

    (deprecated)
    done (bool): A boolean value for if the episode has ended, in which case further :meth:`step` calls will return undefined results.
        A done signal may be emitted for different reasons: Maybe the task underlying the environment was solved successfully,
        a certain timelimit was exceeded, or the physics simulation has entered an invalid state.

|    | **Details** |
| -- | ----------- |
| action | Action to be executed on the environment. Should be the column in which the agent wants to place the dice. |

# `Render`

> Lastly, only rendering the environment is left.

As we have previously built a quite decent `__repr__` method, we are going to only use that one. It would be nice to get something nicer runnig with *PyGame*, tho.

In [4]:
#|output: asis
#| echo: false
show_doc(MatatenaEnv.render)

---

[source](https://github.com/Jorgvt/matatena/blob/main/matatena/gym.py#L98){target="_blank" style="float:right; font-size:smaller"}

### MatatenaEnv.render

>      MatatenaEnv.render ()

Compute the render frames as specified by render_mode attribute during initialization of the environment.

The set of supported modes varies per environment. (And some
third-party environments may not support rendering at all.)
By convention, if render_mode is:

- None (default): no render is computed.
- human: render return None.
  The environment is continuously rendered in the current display or terminal. Usually for human consumption.
- single_rgb_array: return a single frame representing the current state of the environment.
  A frame is a numpy.ndarray with shape (x, y, 3) representing RGB values for an x-by-y pixel image.
- rgb_array: return a list of frames representing the states of the environment since the last reset.
  Each frame is a numpy.ndarray with shape (x, y, 3), as with single_rgb_array.
- ansi: Return a list of strings (str) or StringIO.StringIO containing a
  terminal-style text representation for each time step.
  The text can include newlines and ANSI escape sequences (e.g. for colors).

Note:
    Rendering computations is performed internally even if you don't call render().
    To avoid this, you can set render_mode = None and, if the environment supports it,
    call render() specifying the argument 'mode'.

Note:
    Make sure that your class's metadata 'render_modes' key includes
    the list of supported modes. It's recommended to call super()
    in implementations to use the functionality of this method.

# Usage

> Simple usage examples.

In [None]:
env = MatatenaEnv()
obs, info = env.reset()
env.render()
print(f"Rolled dice is: {obs['dice']}")

Player 1 (0.0) | Player 2 (0.0) *
[[0. 0. 0.]    | [[0. 0. 0.]     
 [0. 0. 0.]    |  [0. 0. 0.]     
 [0. 0. 0.]]   |  [0. 0. 0.]]    
Rolled dice is: 6


In [None]:
action = env.action_space.sample()
print(f"Placing the dice in column: {action}")
obs, reward, done, info = env.step(action)
env.render()

Placing the dice in column: 2
Player 1 (0.0) * | Player 2 (6.0)
[[0. 0. 0.]      | [[0. 0. 6.]   
 [0. 0. 0.]      |  [0. 0. 0.]   
 [0. 0. 0.]]     |  [0. 0. 0.]]  


We can even perform a full game:

In [None]:
env = MatatenaEnv()
obs, info = env.reset()
done = False

while not done:
    action = env.action_space.sample()
    obs, reward, done, info = env.step(action)
    env.render()

Player 1 (0.0) * | Player 2 (4.0)
[[0. 0. 0.]      | [[0. 0. 4.]   
 [0. 0. 0.]      |  [0. 0. 0.]   
 [0. 0. 0.]]     |  [0. 0. 0.]]  
Player 1 (2.0) | Player 2 (4.0) *
[[2. 0. 0.]    | [[0. 0. 4.]     
 [0. 0. 0.]    |  [0. 0. 0.]     
 [0. 0. 0.]]   |  [0. 0. 0.]]    
Player 1 (2.0) * | Player 2 (5.0)
[[2. 0. 0.]      | [[1. 0. 4.]   
 [0. 0. 0.]      |  [0. 0. 0.]   
 [0. 0. 0.]]     |  [0. 0. 0.]]  
Player 1 (3.0) | Player 2 (5.0) *
[[2. 0. 1.]    | [[1. 0. 4.]     
 [0. 0. 0.]    |  [0. 0. 0.]     
 [0. 0. 0.]]   |  [0. 0. 0.]]    
Player 1 (3.0) * | Player 2 (7.0)
[[2. 0. 1.]      | [[1. 2. 4.]   
 [0. 0. 0.]      |  [0. 0. 0.]   
 [0. 0. 0.]]     |  [0. 0. 0.]]  
Player 1 (5.0) | Player 2 (5.0) *
[[2. 2. 1.]    | [[1. 0. 4.]     
 [0. 0. 0.]    |  [0. 0. 0.]     
 [0. 0. 0.]]   |  [0. 0. 0.]]    
Player 1 (3.0) * | Player 2 (7.0)
[[2. 0. 1.]      | [[1. 2. 4.]   
 [0. 0. 0.]      |  [0. 0. 0.]   
 [0. 0. 0.]]     |  [0. 0. 0.]]  
Player 1 (8.0) | Player 2 (7.0) *
[[2. 0. 1.]   