<a href="https://colab.research.google.com/github/LomaxOnTheRun/policy-gradient-methods/blob/master/policy_gradient.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Policy Gradient

This is going to be an overview of a basic policy gradient method, with a view to expand to actor-critic and PPO methods in later notebooks.

## Environment

To test our algorithm I'm going to use a basic gridworld with a fixed layout and starting state (`s0`), a positive terminal state (`+1`), a negative terminal state (`-1`). The outer border cannot be crossed and the filled position (`###`) cannot be entered; an attempt to do either will simply leave the agent where it began.

A visual representation of this gridworld is given below:

```
+------+------+------+------+
|      |      |      |  +1  |
+------+------+------+------+
|      |######|      |  -1  |
+------+------+------+------+
|  s0  |      |      |      |
+------+------+------+------+
```

In [0]:
class State:
    def __init__(x, y, reward):
        self.x = x
        self.y = y
        self.reward = reward

    def is_terminal(self):
        return self.reward != 0

def create_gridworld():
    """Return a dict of (x, y) coords to State object."""
    gridword = {}
    for x in range(4):
        for y in range(3):
            # Simply ignore the block
            if x == 1 and y == 1:
                continue
            reward = 0
            if x == 3 and y == 0: reward = 1
            if x == 3 and y == 1: reward = -1
            gridworld[(x, y)] = State(x, y, reward)
    return gridworld

def get_new_state(gridword, state, direction):
    """Return the new state, after trying to move in a direction, (dx, dy)."""
    new_x = state.x + direction[0]
    new_y = state.y + direction[1]
    new_state = gridworld.get((new_x, new_y), state)
    return new_state

## Policy gradient

Policy gradient methods attempt to learn a policy without needing to learn either an underlying model of the environment or even the values states, which could then be used to construct a policy (e.g. by moving toward the next reachable state with the highest value).

A policy is dependent on the current state, $s$, and can be described by a set of parameters, $\theta$. In a simple case these parameters can each represent a single state, but in more complex cases could be the weights of a neural network.

From [Sutton and Barto](http://incompleteideas.net/book/RLbook2020.pdf) we get the 

In [2]:
import gym

env = gym.make('FrozenLake8x8-v0')

print(env.nS, env.nA)

64 4
