## This is a quick demonstration of some of the functionalities of the gridworld package

There are 4 main classes within this package:
- a GridWorld class that defines a grid
- a Transition_Probs class that defines the transition probabilities of the grid
- a Rewards class that defines the rewards associated with the Grid
- a Agent class that defines an agent

### 1) Grid

An instance of a grid can be defined by specifing the height and width.

In [2]:
from gridworld.Grid import GridWorld

height = 4 
width = 4

grid = GridWorld(4,4)
grid.print_grid() # simple printing function to visualize the grid

|1	2	3	4	|
|5	6	7	8	|
|9	10	11	12	|
|13	14	15	16	|


Now we proceed to environment dynamics: 

### 2) Transitions

Transitions probabilities are defined over a set of actions on a grid. An instance can be created as follows:

In [3]:
from gridworld.Transitions import Transitions_Probs

# define the actions 
actions = ["up","down","left","right"]

tp = Transitions_Probs(grid,actions) # transitions are defined over a grid given a set of actions

The transitions are bascially a 3-D matrix defined as [states][actions][states] to be able to capture the usual notion of transitions in MDP; P(s' | s , a).

You are therefore given the freedom to define any set of actions over this gridworld that can have any set of desired outcome. You will just have to be able to construct the appropriate 3-D matrix associated. 

For example, if I want to have an action "jump" that moves me 2 spaces in the grid. I will have to have a transtion matrix that represents this. Say I'm in state 1 and want to use action "jump right" and have the desired outcome always occur (i.e. no stochastisity envolved). Then, in the 3-D matrix I would set [1]["jump right"][3] = 1. Therefore given a probability of 1 to ending up in state 3 given that I jumped right at state 1. This would then have to be done for all (state,actions,state) elements in the matrix. 

In this way, arbitrary actions and transitions can be specified. To simply this for the user, there exists functions to directly create common transition assume the usual "up","down","left","right" actions.

In [4]:
tp.create_common_transition("Deterministic") # no stochastisity always move where the agent wants to go

In [5]:
p = 0.7
tp.create_common_transition( ("Bernoulli", p) ) # Associate a probability of success of moving 
                                               # in the desired direction. w.p (1-p) agent stays where it is


In [6]:
tp.create_common_transition( ("Random", p)) # Similar to "Bernoulli" except you move in a random other direction
                                            # with probability (1-p)

In [7]:
terminal_states = [16] 
tp.add_terminal_states(terminal_states)

### 3) Rewards

Similar to Transitions, rewards are also define as a 3-D matrix. This allows the user to make rather complex rewards functions if he/she wishes. They are defined on a grid for a given set of actions.

Also similar to Transitions, there is a function for commonly used rewards call "commom_reward". Here, we assume the rewards to be fixed constants (not from a distribution although one could create a 3-D that does this) for a given state. The way of creating this can be as follows:

In [8]:
from gridworld.Rewards import Reward

reward_env = Reward(grid, actions) # create the reward_env for this grid environment given the set of actions

# If the rewards are constant for a state, this can be specified in a dictionary 
#{ state1:reward1 , state2:reward2 , ...} 

defined_reward = {1:1 , 4:10} # Here, at state 1 I have a reward of 1 and at state 4 I have a reward of 10.
# now create the environment with the given rewards
reward_env.common_reward(defined_reward)

Note that it must be a dictionary for the function to work

### 4) Agent

An agent must be defined on a grid with a given set of actions and a policy. 

The policy is just a 2-D matrix (state,actions). $\pi$(a|s) = P(agent chooses action a | it is in state s). We can do this simply as follows using numpy matricies.

In [9]:
import numpy as np
# create the uniform policy 
policy = np.ones( (len(grid.states), len(actions)) ) * 0.25 
policy # each entry (row s , column a ) = P( agent chooses action a| it is in state s)

array([[0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25]])

In [10]:
# now create the agent.
from gridworld.Agent import Agent

start_state = 1 # define where the agent starts
agent = Agent(grid, actions, policy, start_state = start_state)

The agent has some functions:

In [11]:
agent.agent_copy() # create a new reference with the same attributes

<gridworld.Agent.Agent at 0x7f341585b4e0>

In [12]:
agent.get_state() #returns current state of the agent

1

In [13]:
agent.next_action() #returns the next action the agent will take given the policy 
                    # if the policy is a distribution (non deterministic) than it returns a sample

array(['left'], dtype='<U5')

In [14]:
agent.outcome() #returns one step on the agent in the environment as a tuple (s1, a1, r2, s2)
                # not that the agent's current state changes since it performed that move. Don't forget to reset 
                # the agent when necessary

(1, 'up', 1.0, 1)

sample_episodes runs an episode of the agent in the environment. An episode can end based on different user specification: 
- the flag "terminal_state=state" tells the episode to end then
- the flag "steps_per_episode = N" specifies how many steps before the episode end automatically 

In [15]:
number_of_episodes = 10

agent.sample_episode(10, terminal_state = 16, steps_per_episode = 20) # returns as a list of list of episodes
                                                                      # will start at "start_state" specified before

[[(1, 'down', 0.0, 5),
  (5, 'up', 1.0, 1),
  (1, 'down', 0.0, 5),
  (5, 'right', 0.0, 6),
  (6, 'left', 0.0, 5),
  (5, 'right', 0.0, 6),
  (6, 'left', 0.0, 5),
  (5, 'up', 1.0, 1),
  (1, 'down', 0.0, 2),
  (2, 'left', 0.0, 6),
  (6, 'left', 0.0, 5),
  (5, 'up', 1.0, 1),
  (1, 'up', 1.0, 1),
  (1, 'up', 1.0, 1),
  (1, 'right', 0.0, 2),
  (2, 'down', 0.0, 6),
  (6, 'right', 0.0, 7),
  (7, 'down', 0.0, 11),
  (11, 'left', 0.0, 10),
  (10, 'left', 0.0, 9)],
 [(1, 'up', 1.0, 1),
  (1, 'left', 1.0, 1),
  (1, 'down', 0.0, 5),
  (5, 'left', 0.0, 5),
  (5, 'up', 1.0, 1),
  (1, 'up', 1.0, 1),
  (1, 'up', 1.0, 1),
  (1, 'down', 0.0, 5),
  (5, 'left', 0.0, 5),
  (5, 'right', 0.0, 6),
  (6, 'up', 0.0, 7),
  (7, 'right', 0.0, 6),
  (6, 'right', 0.0, 7),
  (7, 'down', 0.0, 6),
  (6, 'right', 0.0, 7),
  (7, 'down', 0.0, 11),
  (11, 'up', 0.0, 15),
  (15, 'up', 0.0, 14),
  (14, 'down', 0.0, 14),
  (14, 'right', 0.0, 15)],
 [(1, 'up', 1.0, 1),
  (1, 'down', 0.0, 5),
  (5, 'right', 0.0, 6),
  (6, 'left'