# Maze runner
In this notebook we will cover the basics of a reinforcement learning (RL) environment.

Specifically, we will cover the observation, action, and state space following the example of a maze.

In [None]:
import numpy as np
from stable_baselines3 import PPO
from stable_baselines3.common.env_checker import check_env

from qgym.spaces import Discrete
from qgym.templates import Environment, Rewarder, State

### Maze layout

In this notebook we will design our own maze, which the agent will learn to navigate through.
This maze will have 4 different field types.

- `F`: a free field
- `W`: a wall
- `G`: the goal
- `S`: possible start position(s); also a free field

### To Do
Design a fun 4 by 4 maze by making a list of 4 letter strings. For example `["FSFF", "SWFW", "FFFW", "WFFG"]` would be a fine maze.
Note that we allow to have multiple starting points.

In [None]:
maze_map_4x4 = []  # Make your own maze
print("\n".join(maze_map_4x4))

### Environment spaces

A Reinforcement Learning environment consists of several spaces that describe its workings:
- State space
- Action space
- Observation space

#### State space
The state space contains the complete information of the current state of the environment.
The state space is for internal use and is not used by the agent.

In our maze environment the state consists of five items:
- `position`: the current position of the agent in the maze
- `maze_map`: a map of the maze
- `start_position_distribution`: Distribution from which position an episode can start.
- `nrows`: Number of rows in the maze grid.
- `ncols`: Number of collumns in the maze grid.

#### Action space
The action space describes all potential actions that an agent can take in this environment.

In our maze environment there are only 4 such actions:
- `0`: UP
- `1`: RIGHT
- `2`: DOWN
- `3`: LEFT

#### Observation space
The observation space describes all potential observations that an agent can obtain from the environment. It is generally a subset (or a transformation thereof) of the state space.

In our case the observation space contains all possible `position`s the agent can be at.

#### Constraints on the spaces
Reinforcement Learning agents often accept a limited amount of data types.
The most commonly supported data types are `int`, `float`, `char` and arrays of these types.
Therefore, we will transform the fun maze above into such an array.
Furthermore, when setting up the environment, we will need a starting position.
Since we allow for multiple starting positions, we should determine a probability distribution to say where we start in an episode.

In [None]:
# we can transform our map to a 4x4 array of chars as follows:
transformed_maze = np.asarray(maze_map_4x4, dtype="c")
print(f"Maze:\n{transformed_maze}\n")

# we can compute a probability distribution over the start positions as follows
distribution = (transformed_maze == b"S").ravel().astype("float64")
distribution /= distribution.sum()
print(f"Distribution:\n{distribution}")

### To Do
Setup state space by setting up the attributes `position`, `maze_map`, `start_position_distribution`, `nrows` and `ncols`

In [None]:
class MazeState(State):
    def __init__(self, maze_map):
        maze_map = np.asarray(maze_map, dtype="c")

        self.start_position_distribution = (maze_map == b"S").ravel().astype(float)
        self.start_position_distribution /= self.start_position_distribution.sum()

        # Set the other internal attributes
        # self.nrows =
        # self.ncols =
        # self.position =
        # self.maze_map =

### Initial position
At the beginning of each episode, the environment should provide the agent with a fresh start, without any leftovers over the previous iteration.

In this environment, the fresh start consists of a randomly selected initial position from all possible _start positions_.

### To Do
First implement two utility methods `rowcol_to_pos` and `pos_to_rowcol`.
Subsequently, implement the `reset` method (for which you will need the `pos_to_rowcol` method).

_Hint: Each environment has a random number generator `self.rng` with a [`choice`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.choice.html) method._

In [None]:
class MazeState(MazeState):
    def rowcol_to_pos(self, row, col):
        # todo: transform (row, col) pair to an integral position

    def pos_to_rowcol(self, pos):
        # todo: inverse transformation

    def reset(self, *, seed=None, return_info=False):
        # todo: reset the position to a new random one

        return self

### First steps
In order to let the agent actually get out of the maze it needs to be able to move to an adjacent field. Recall that we defined our action space as:

- `0`: UP
- `1`: RIGHT
- `2`: DOWN
- `3`: LEFT.

### To Do
Write the `update_state` method, which should update the state according to the given action.

_Note: How should we deal with illegal actions (i.e. falling of the grid, bumping into a wall)?_

In [None]:
class MazeState(MazeState):
    def update_state(self, action):
        # todo

### Observational awareness
Our environment is nearly done, with the `update_state` method the agent can move through the maze.
However, remember that the agent can't see the state space!
We still need to provide our agent a set of 'eyes', i.e., observations.

Specifically we need to inform the agent of 2 more things:

- The current position (our observation).
- Whether we have reached the exit (are we done?).

### To Do
Implement the `create_observation_space`, `obtain_observation` and `is_done` methods.

_Hint: `qgym` provides a ready-to-use `Discrete` action/observation space._

In [None]:
class Maze(Maze):
    def create_observation_space(self):
        # todo
    
    def _obtain_observation(self):
        # todo

    def _is_done(self):
        # todo

    ### This method is given 'glue code' ###
    def _obtain_info(self):
        return {}

### Carrot and Stick
The final step towards completing our `Maze` environment is about giving feedback. We can give 3 types of feedback:
- positive feedback (carrot)
- negative feedback (stick)
- neutral feedback

In Reinforcement Learning, feedback is given by means of rewards. The choice of rewarder function has a lot of influence on the learning ability of the agent.

Only providing rewards might lead to slow training and too much exploration. However, big penalties could make the agent skip exploration and stick to a safe, potentially non-optimal path.

Below is room for two the implementation rewarders `CarrotOnly` (1) and `CarrotAndSticks` (2):

1. Provides a positive reward only when the goal is reached. Does nothing otherwise
2. Provides a positive reward when the goal is reached. Gives a negative reward (penalty) when the agent bumps into a wall. Otherwise, nothing.

We have also including code blocks for training an agent on the `Maze` environment with either rewarder.

### To Do
#### Carrot only
In the first block below we can implement the first rewarder:

1. Provides a positive reward only when the goal is reached. Does nothing otherwise

After defining the rewarder we can train an agent by running the second code block.

In [None]:
class CarrotOnly(Rewarder):
    def __init__(self):
        # todo
        # self._reward_range = (0, 0)

    def compute_reward(self, old_state, action, new_state):
        # todo: return reward

In [None]:
# define environment and rewarder
env = Maze(maze_map_4x4)
env.rewarder = CarrotOnly()

# ensure that we have implemented our environment correctly
check_env(env, warn=True)

# define and train our model
# todo: feel free to pick another model or policy, for more information see https://stable-baselines3.readthedocs.io/en/master/guide/algos.html
model = PPO("MlpPolicy", env, verbose=1)
model.learn(int(1e5))

### To Do
#### Carrot and sticks
In the first block below we can implement the second rewarder

2. Provides a positive reward when the goal is reached. Gives a negative reward (penalty) when the agent bumps into a wall. Otherwise, nothing.

After defining the rewarder we can train an agent by running the second code block.

In [None]:
class CarrotAndSticks(Rewarder):
    def __init__(self):
        # todo
        # self._reward_range = (0, 0)

    def compute_reward(self, old_state, action, new_state):
        # todo: return reward

In [None]:
# todo: test your environment like above