# Gridworld
## Introduction
In this notebook, we would be implementing the Gridworld as described in the book, Reinforcement Learning: An Introduction[Sutton, R. S., Bach, F., &amp; Barto, A. G. (2018). Reinforcement learning: An introduction. MIT Press Ltd. ]
### Gridworld
A gridworld is a 2D rectangular grid of size (Ny, Nx) with an agent starting off at one grid square and trying to move to another grid square located elsewhere. Each cell in the grid is considered a state and a grid can contain starting states, terminal states, and non-terminal states.
### Our Gridworld
In this notebook, we would be referencing the gridworld example in the book, under **Example 4.1**.

![image-2.png](attachment:image-2.png)

In this example, we see that there are 2 terminal states (shaded black), 14 starting states labeled 1-14, and no starting states. It is also given that in each state, an agent can take 4 actions, up, down, left, and right, with a reward of -1 on each transition. In other words,

$$
\text{State Space} \leftarrow S={1,2,...,14} \\
\text{Action Space} \leftarrow A={up, down, left, right}\\
{R_t=-1}_\text{On all transitions}
$$

Lastly, it is also given that the agent follows an equiprobable random policy, meaning that all actions are equally likely to be taken in each state.

$$
\pi(a \mid s) = 0.35
$$

## Agent-Environment Interface
Recall that in Reinforcement Learning, the problem to resolve is described as a Markov Decision Process(MDP). The learning and decision maker is called the *agent*, where the thing it interacts with, comprising everything outside the agent, is called the *environment*. These components interact continually, with the agent selecting actions and the environment responding to these actions and presenting new situations to the agent. The environment also gives rewards, special numerical values that the agent seeks to maximize over time through its choice of actions.

![image.png](attachment:image.png)

## Environment
As mentioned, the environment is the thing that agents interact with, presenting new situations and giving rewards to the agent after agents select an action. Let's begin developing our environment.

*Note: From here on, we would be using [this Medium article](https://medium.com/analytics-vidhya/a-simple-reinforcement-learning-environment-from-scratch-72c37bb44843) and the [OpenAI Gym API reference](https://www.gymlibrary.dev/api/core/#gym-env) as references.*

### RL Environment Fundamentals
The following are necessities for an RL Environment:
1. State/Observation set of the environment
2. Reward/Penalty from environment to agent
3. Action set for agent, within boundaries of the environment

### Environment Class
To encompass the fundamentals above, these are the methods that we would require to build our environment:
#### step()
Run one timestep of the environment's dynamics.

<u>Inputs</u><br>
*action* - An action provided by the agent.

<u>Outputs</u><br>
*observation* - An element of the environment's *observation_space*.<br>
*reward* - Amount of reward returned as a result of taking the action.<br>
*terminated* - Whether a *terminal state* is reached. Further *step()* calls could return undefined results.<br>
*truncated* - Whether a truncation condition outside the scope of the MDP is satisfied. Usually a timelimit, but could also mean agent going out of bounds.<br>
*info* - Contains auxiliary diagnostic information. i.e. Metrics describing agent's performance state, variables that are hidden from observations, etc.<br>
#### reset()
Resets the environment to an initial state and returns the initial observation. Typically called after an entire episode.

<u>Outputs</u><br>
*observation* - Observation of the initial state. An element of the *observation_space*.
*info*  Contains information complementing *observation*. Same as *info* in *step()*.

*Note: In OpenAI Gym's implementation, a class 'space' is used to define observation_space and action_space. Spaces are crucially used in Gym to define the format of valid actions and observations and serve various purposes:*
- They clearly define how to interact with environments, i.e. they specify what actions need to look like and what observations will look like
- They allow us to work with highly structured data (e.g. in the form of elements of Dict spaces) and painlessly transform them into flat arrays that can be used in learning code
- They provide a method to sample random elements. This is especially useful for exploration and debugging.

*In this notebook, we will NOT be implementing this class and will treat all actions and observation as discrete values.*

## Environment - Implementation
Let's get down to implement our environment.

In [27]:
import numpy as np

class gridworld_env:
    def __init__(self, size):
        self.size = size
        self.state = [0,0]
        
    def step(self, action):
        # Terminal state
        if self.state == [0, 0] or self.state == [self.size-1, self.size-1]:
            return self.state, 0, True
        
        # Non-terminal state
        new_state = self.state.copy()
        
        # Check boundaries, clip if OOB
        if 0 <= (self.state[0] + action[0]) <= (self.size-1):
            self.state[0] += action[0]
        if 0 <= (self.state[1] + action[1]) <= (self.size-1):
            self.state[1] += action[1]

        return self.state, -1, False
    
    def reset():
        self.state = [0, 0]

In [28]:
env = gridworld_env(4)

[0, 0]
