# Maze runner
In this notebook we will cover the basics of a reinforcement learning (RL) environment.

Specifically, we will cover the observation, action, and state space following the example of a maze.

In [None]:
import numpy as np
import gym
from stable_baselines3 import PPO
from stable_baselines3.common.env_checker import check_env

from qgym.environment import Environment
from qgym.rewarder import Rewarder

### Maze layout

Our maze will have 4 different field types.

- `F`: a free field
- `W`: a wall
- `G`: the goal
- `S`: possible start position(s); also a free field

In [None]:
maze_map_4x4 = ["FSFF", "SWFW", "FFFW", "WFFG"]
print("\n".join(maze_map_4x4))

### Environment spaces

A Reinforcement Learning environment consists of several spaces that describe its workings:
- State space
- Action space
- Observation space

#### State space
Current position and map of the maze

#### Action space

- `0`: UP
- `1`: RIGHT
- `2`: DOWN
- `3`: LEFT

#### Observation space
Current position

_Hint: OpenAI Gym provides a ready-to-use [`Discrete`](https://www.gymlibrary.ml/content/spaces/#discrete) action/observation space._

In [None]:
class Maze(Environment):
    def __init__(self, maze_map):
        maze_map = np.asarray(maze_map, dtype="c")  # todo

        self.nrows = maze_map.shape[0]
        self.ncols = maze_map.shape[1]

        self.start_position_distribution = (maze_map == b"S").ravel().astype("float64")
        self.start_position_distribution /= (
            self.start_position_distribution.sum()
        )  # todo

        self.action_space = gym.spaces.Discrete(4)  # {0,1,2,3}
        self.observation_space = gym.spaces.Discrete(self.nrows * self.ncols)
        self._state = {"position": None, "maze_map": maze_map}

### Initial position
At the beginning of each episode, the environment should provide the agent with a fresh start, without any leftovers over the previous iteration.

In this environment, the fresh start consists of a randomly selected initial position from all possible _start positions_.

_Hint: Each environment has a random number generator `self.rng` with a [`choice`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.choice.html) method._

In [None]:
class Maze(Maze):
    def rowcol_to_pos(self, row, col):
        return row * self.nrows + col

    def pos_to_rowcol(self, pos):
        return int(pos / self.nrows), pos % self.nrows

    def reset(self, *, seed=None, return_info=False):
        start_position = self.rng.choice(
            self.nrows * self.ncols, p=self.start_position_distribution
        )
        self._state["position"] = self.pos_to_rowcol(start_position)

        return super().reset(seed=seed, return_info=return_info)

### First steps
In order to let the agent actually get out of the maze it needs to be able to move to an adjacent field. Recall that we defined our action space as:

- `0`: UP
- `1`: RIGHT
- `2`: DOWN
- `3`: LEFT.

_Note: How should we deal with bumping into the wall?_

In [None]:
class Maze(Maze):
    def _update_state(self, action):
        row, col = self._state["position"]

        # compute new position
        if action == 0:  # up
            row = max(row - 1, 0)
        elif action == 1:  # right
            col = min(col + 1, self.ncols - 1)
        elif action == 2:  # down
            row = min(row + 1, self.nrows - 1)
        elif action == 3:  # left
            col = max(col - 1, 0)
        else:
            raise ValueError("Invalid action supplied.")

        # go to new position if it is not a wall
        if self._state["maze_map"][row][col] != b"W":
            self._state["position"] = (row, col)
        # else we stay where we are

### Observational awareness
Our environment is nearly done, but we still need to provide our agent a set of 'eyes'.

Specifically we need to inform the agent of 2 more things:

- The current position (our observation).
- Whether we have reached the exit (are we done?).

In [None]:
class Maze(Maze):
    def _obtain_observation(self):
        return self.rowcol_to_pos(*self._state["position"])

    def _is_done(self):
        row, col = self._state["position"]
        return self._state["maze_map"][row][col] == b"G"

    def _obtain_info(self):
        return {}

    def _compute_reward(self, old_state, action):
        return super()._compute_reward(
            old_state=old_state, action=action, new_state=self._state
        )

### Carrot and Stick
The final step towards completing our `Maze` environment is about giving feedback. We can give 3 types of feedback:
- positive feedback (carrot)
- negative feedback (stick)
- neutral feedback

In Reinforcement Learning, feedback is given by means of rewards. The choice of rewarder function has a lot of influence on the learning ability of the agent.

Only providing rewards might lead to slow training and too much exploration. However, big penalties could make the agent skip exploration and stick to a safe, potentially non-optimal path.

Below is room for two rewarders `CarrotOnly` (1) and `CarrotAndSticks` (2):

1. Provides a positive reward only when the goal is reached. Does nothing otherwise
2. Provides a positive reward when the goal is reached. Gives a negative reward (penalty) when the agent bumps into a wall. Otherwise, nothing.

In [None]:
class CarrotOnly(Rewarder):
    def __init__(self):
        self._reward_range = (0, 1)

    def compute_reward(self, old_state, action, new_state):
        row, col = new_state["position"]
        if new_state["maze_map"][row][col] == b"G":
            return 1
        else:
            return 0

In [None]:
class CarrotAndSticks(Rewarder):
    def __init__(self):
        self._reward_range = (-1, 10)

    def compute_reward(self, old_state, action, new_state):
        row, col = new_state["position"]

        if new_state["position"] == old_state["position"]:
            return -1
        elif new_state["maze_map"][row][col] == b"G":
            return 10
        else:
            return 0

### Training an agent

Below is room to train our environment using either of the two rewarders.

In [None]:
# define environment and rewarder
env = Maze(maze_map_4x4)
env.rewarder = CarrotOnly()

# ensure that we have implemented our environment correctly
check_env(env, warn=True)

# define and train our model
model = PPO("MlpPolicy", env, verbose=1)
model.learn(int(1e5))

In [None]:
# define environment and rewarder
env = Maze(maze_map_4x4)
env.rewarder = CarrotAndSticks()

# ensure that we have implemented our environment correctly
check_env(env, warn=True)

# define and train our model
model = PPO("MlpPolicy", env, verbose=1)
model.learn(int(1e5))