In [88]:
from my_gridworld import my_gridworld
test = my_gridworld()
test.qlearn(gamma = 0.5)

q-learning process complete


In [89]:
test.color_grid()

In [90]:
starting_location = [0,2]
test.animate_movement(starting_location)

In [84]:
test.show_qmat()

          up      down     left     right    still
00    -1.000    -1.000    7.812     1.953    3.906
01    -1.000     3.906    3.906     3.906    1.953
02    -1.000     1.953   -1.000     7.812    3.906
03    -1.000     3.906   15.625    15.625    7.812
04    -1.000     7.812   31.250    31.250   15.625
05    -1.000    15.625   62.500    -1.000   31.250
10     3.906    -1.000   15.625     3.906    7.812
11     1.953     7.812   -1.000    -1.000    3.906
13     7.812    -1.000   -1.000    31.250   15.625
14    15.625    15.625   62.500    62.500   31.250
15    31.250    31.250  125.000    -1.000   62.500
20     7.812    -1.000   31.250    -1.000   15.625
22    -1.000    -1.000  125.000    -1.000   62.500
24    31.250    -1.000   -1.000   125.000   62.500
25    62.500    62.500  250.000    -1.000  125.000
30    15.625    -1.000   62.500    62.500   31.250
31    -1.000    31.250  125.000   125.000   62.500
32    62.500    62.500  250.000   250.000  125.000
33    -1.000   125.000  500.000

Before looking in detail at a class control problem - cart pole - lets look at a simpler problem typically called 'grid world'.  

This is a simplified version of the shortest path problem - a problem often solved in video games (this is how enemy AI units find you in the game), mapping services (finding the shortest route from A to B), and robot path planning (e.g., for a cleaning robot) - get a user from its starting location to a desired destination in as few steps as possible.    

There are many algorithms specifically designed to solve just this task - the most popular being [Dijkstra’s and A* algorithms](http://www.redblobgames.com/pathfinding/a-star/introduction.html).  However here we will use the flexible RL framework as it too provides great results, and the (relative) simplicity of the problem will allow us to illustrate the more general RL learning process hopefully clearly.

In this toy problem we have a grid like the one illustrated below.  Each square tile is a location in the world.  Here the black square denotes the user, the green square the desired destination, and the blue squares impenetrable obstacles (the user cannot move to).  

The actions available to our user are to move one unit left, right, up, or down, or stay still, and it can move to any free square (here colored magenta) or the goal (colored green).  If the user ever moves to the goal square the game is over.

## Reinforcement learning: components

How can we train our agent to move the user square effeciently to the goal, regardless of the user's initial location?  In other words, how can we teach the agent the right action to take at each state (here state = location on the grid)?  Lets go over this at a high level, in 3 steps.

Remember the sort of information the agent deals with, and the control (actions) it has over the user.  


- The agent is aware of the current **state** of the enviroment - what the agent 'sees', the information it receives, at each step.  In *grid world* this is just the location of the user square.  Generally speaking - we decide the type of information that constitutes the state and in practice this depends on what information we think is reasonably available (to give to our agent).  


- The agent can then take an **action** - the set of actions is typically determined by the problem environment.  In *grid world*, for example, the agent can move the user square adjacently **one unit** up/down/left/right or keep it still.  Another example, in an autopilot control problem the range of actions is completely defined by the available range of motions of the machine being controlled.


- Once this action is taken the agent receives a **reward** - based on where the action took the user square (the new state of the user).  We decide what the rewards look like (not the agent itself), which is how we communicate our goal to the agent, and we want a reward for a given action to be **larger** for those actions that get us closer to accomplishing our goal (and less for those actions which do not).  In *grid world* we assign a negative value like -1 to all actions (one unit movement) which leads to a non-goal state, and a large positive number like 1000 to actions leading to the goal state itself.  

## How the components fit together

Note the sequence of events taken by the agent consists of a sequence of steps -


**step 1:** start at state 0 $s_0$ --> take action 1 $a_1$ --> move to state 1 $s_1$ + recieve reward 1 $r_1$

**step 2:** start at state 1 $s_1$ --> take action 2 $a_2$--> move to state 2 $s_2$ + recieve reward 2 $r_2$

**step 3:** start at state 2 $s_2$ --> take action 3 $a_3$ --> move to state 3 $s_3$ + receive reward 3 $r_3$

**step 4:** ...

or in short the first three steps look like

($s_0$, $a_1$, $r_1$), ($s_1$, $a_2$, $r_2$), ($s_2$, $a_3$, $r_3$), ($s_3$,...

and once we tune the agent correctly we want it to choose the best action at each step of this process, to *maximize* its reward.

In other words we want to teach our agent the best *function* from states --> actions, whose chosen actions maximize the rewards the agent recieves.

## Training the agent by Q learning

How do we do this - train the agent to take the proper actions at each state to **maximize its total reward**?  Calling $r_k = r(s_{k-1},a_k,s_k)$ the reward at the $k^{th}$ step by taking action $a_k$ and moving from state $s_{k-1}$ to $s_k$, mathematically we want to maximize the sequence of rewards

$Q(s_0,a_1) = r_1 + r_2 + r_3 + ... $


This is a cost function that - in the case of grid world - we would like to be maximal for every possible initial state $s_0$.  Because the number of states and actions is finite $Q$ is a just a matrix of size

$Q$ is an (number of states) x (number of actions) dimensional matrix

We cannot expect to directly apply the conventional tools of nonlinear optimization (e.g., stochastic gradient descent) to maximize $Q(s_0,a_1)$, as e.g., both inputs are discrete.

### Another common Q-learning cost function

Another common cost function used to maximize these rewards dampens later rewards using a controller $\gamma \in [0,1]$ to lessen the effect of future state rewards. 

$Q(s_0,a_1) = r_1 + \gamma^1r_2 + \gamma^2r_3 + ... $

By scaling $\gamma$ up and down we can lessen the contribution of future rewards.  For example

- When we set $\gamma = 0$ then only the first reward remains $r_1$.   Maximizing $Q$ means we maximize the reward given by the first step of the process, for all states.  Our agent learns to take a 'greedy' approach to accomplishing our goal, at each state taking the next step that maximizes the next step reward only.


- When we set $\gamma = 1$ then we all the $\gamma$'s disappear and we have our original cost function.  

## Approximate maximization via recursion 

How can we maximize

$Q(s_0,a_1) = r_1 + r_2 + r_3 + ... $

in practice?  Call this maximum (that is, the largest sum of rewards after recieving $r_1$ from making decision $a_1$ at $s_0$) $Q^*(s_0,a_1)$.  The trick here is that this definition is *recursive*, that is

$Q^*(s_0,a_1) = r_1 + Q^*(s_1,a_2)$

This recursive version is sometimes called *Bellman's equation*. 

While this recursive definition doesn't help us to optimize $Q^*$ *directly*, we can use it to apply a heuristic in order to *approximately* maximize this quantity.  In short: we run through the gamat of possible initial states and update $Q^*$ by trial-and-error interactions with the enviroment.  We run a (large number of) simulations - called *episodes* in RL jargon - which en masse help carve out the right set of actions for our agent to take.  

Here's how we do it - we run over a large sequence of episodes where we allow our agent to run the user around towards the goal from various initial states, and update $Q^*$ as we go along.  For each episode we:

1.  Initialize $Q^*$ 
2.  Select a random initial state $s_0$
3.  While the goal state has not been reached we 
    -  select a random action at our current state $s_k$
    -  update our approximation 
    
    $Q^*(s_k,a_{k+1}) = r(s_k,a_{k+1}) + Q^*(s_{k+1},a_{k+2})$
    
    -  update state $s_k$ --> $s_{k+1}$ given action $a_k$
    
Eventually, as we cycle through the entire set of initial states through episodes this will converge to the true value of $Q^*$.

### How does the agent move intelligently given a learned Q?

Given $Q^*$ learned properly, the agent moves intelligently at a given state $s_k$ by taking the action that maximizes $Q^*$ there.  i.e., 

$a_k = \underset{a}{\operatorname{argmax}} Q^*(s_k,a_{}) $



In [None]:
import gym
env = gym.make('CartPole-v0')
env.reset()
for _ in range(1000):
    env.render()
    env.step(env.action_space.sample()) # take a random action

In [116]:
grid = BlockGrid(5, 5, fill=(234, 123, 234))


In [162]:
obstacles = [[1,2],[3,4],[2,3]]  # impenetrable obstacle locations

states = []
for i in range(grid.height):
    for j in range(grid.width):
        block = [i,j]
        if block not in obstacles:
            states.append(str(i) + str(j))
        
# find state-index
state_ind = states.index((str(3) + str(2)))

In [179]:
ind = np.argwhere(np.asarray((str(3) + str(2))) in np.asarray(states))

15

In [183]:
len(states)

22