A policy *π<sub>t</sub>(s,a)* has been implemented in this Jupyter Notebook to demonstrate the interaction of a reinforcement learning agent with the maze model. It is a simple policy which takes a random action from the available action spaces and updates the location. More complicated policies can be developed which store the state and actions taken, so that they learn from past behaviour. Such policies will be covered in detail in later tasks, whereas this task is a demonstration of the implementation of policies in environments using Python and the OpenAI Gym framework.

In [None]:
# Install rl-gym-maze environment from GitHub.
!pip install gym
!git clone https://github.com/AngusMaiden/rl-gym-maze
!pip install -e ./rl-gym-maze

In [None]:
# Run this code if using this notebook in Google Colab. Restarts the runtime after installing rl-gym-maze.
import os

def restart_runtime():
 os.kill(os.getpid(), 9)

restart_runtime()

In [2]:
# Import the necessary libraries.
import gym
import numpy as np
import random
from random import seed
import time

# Invoke the model maze environment.
env = gym.make('rl_gym_maze:rl-gym-maze-v0')
state = env.reset()

In [7]:
# Define a simple policy which can solve the maze.
# This is a random policy. It moves to another room randomly and stops when it
# reaches the goal location, displaying how many steps it took to get there.

limit = 1000
state = env.reset()
for t in range(limit):
    state, reward, done, info = env.step(env.action_space.sample())
    print(f'Step: {t+1:4} | Location: {state} | Reward received: {reward}')

    if done and t < limit - 1:
        print(f'\nReached the goal in {t+1} steps by random walking.')
        break
else:
    print(f'\nTime limit exceeded. Please try again.')

Step:    1 | Location: (0, 0) | Reward received: -1
Step:    2 | Location: (0, 0) | Reward received: -1
Step:    3 | Location: (0, 1) | Reward received: -1
Step:    4 | Location: (0, 0) | Reward received: -1
Step:    5 | Location: (0, 0) | Reward received: -1
Step:    6 | Location: (0, 0) | Reward received: -1
Step:    7 | Location: (0, 0) | Reward received: -1
Step:    8 | Location: (0, 1) | Reward received: -1
Step:    9 | Location: (0, 1) | Reward received: -1
Step:   10 | Location: (0, 2) | Reward received: -1
Step:   11 | Location: (0, 2) | Reward received: -1
Step:   12 | Location: (0, 3) | Reward received: -1
Step:   13 | Location: (0, 4) | Reward received: -1
Step:   14 | Location: (0, 3) | Reward received: -1
Step:   15 | Location: (0, 3) | Reward received: -1
Step:   16 | Location: (0, 2) | Reward received: -1
Step:   17 | Location: (0, 2) | Reward received: -1
Step:   18 | Location: (0, 2) | Reward received: -1
Step:   19 | Location: (0, 3) | Reward received: -1
Step:   20 |