TD learning combines ideas from dynamic programming and MC.  DP requires full model and never learns from experience.  MC does not require this and can lear, but can only update after completing an episode, while DP uses bootstrapping (iterative estimation).  TD is fully online and can update during an episode, which is important for continuous tasks with no episodes.

First prediction and then control.  There are two control methods:  SARSA and Q learning.  We will apply TD(0), specific case of $TD(\lambda)$.
The general method for finding averages is to using a moving average with exponential decay $$
V_{k+1}(S_t)=V_k(S_t)+\alpha \left(G(t)-V_k(S_t) \right)
$$  Further, 
$$
V(s):=E \left[G(t) | S_t =s \right]=E \left[R(t+1)+\gamma V(S_{t+1}) | S_t =s \right]
$$
If we use the moving average with the recursive definition we obtain:
$$
V_{k+1}(s) = V_k(s)+ \alpha \left[r+\gamma V_k(s^\prime) -V_k(s) \right]
$$
This is online because we can update after we determine the state $s^\prime$ (if we know $G(t)$).

In MC, randomness arises because an episode can play out in different ways (stochastic policy or tranistions).  in TD(0) we are estimating from other estimates ($r+\gamma V(s)$ is an estimate of G) (bootstrapping).

In [1]:
import numpy as np


class Grid: # Environment
    def __init__(self, width, height, start):
        self.width = width
        self.height = height
        self.i = start[0]
        self.j = start[1]

    def set(self, rewards, actions):
        # rewards should be a dict of: (i, j): r (row, col): reward
        # actions should be a dict of: (i, j): A (row, col): list of possible actions
        self.rewards = rewards
        self.actions = actions

    def set_state(self, s):
        self.i = s[0]
        self.j = s[1]

    def current_state(self):
        return (self.i, self.j)

    def is_terminal(self, s):
        return s not in self.actions

    def move(self, action):
        # check if legal move first
        if action in self.actions[(self.i, self.j)]:
            if action == 'U':
                self.i -= 1
            elif action == 'D':
                self.i += 1
            elif action == 'R':
                self.j += 1
            elif action == 'L':
                self.j -= 1
        # return a reward (if any)
        return self.rewards.get((self.i, self.j), 0)

    def undo_move(self, action):
    # these are the opposite of what U/D/L/R should normally do
        if action == 'U':
            self.i += 1
        elif action == 'D':
            self.i -= 1
        elif action == 'R':
            self.j -= 1
        elif action == 'L':
            self.j += 1
        # raise an exception if we arrive somewhere we shouldn't be
        # should never happen
        assert(self.current_state() in self.all_states())

    def game_over(self):
        # returns true if game is over, else false
        # true if we are in a state where no actions are possible
        return (self.i, self.j) not in self.actions

    def all_states(self):
        # possibly buggy but simple way to get all states
        # either a position that has possible next actions
        # or a position that yields a reward
        return set(list(self.actions.keys()) + list(self.rewards.keys()))


def standard_grid():
    # define a grid that describes the reward for arriving at each state
    # and possible actions at each state
    # the grid looks like this
    # x means you can't go there
    # s means start position
    # number means reward at that state
    # .  .  .  1
    # .  x  . -1
    # s  .  .  .
    g = Grid(3, 4, (2, 0))
    rewards = {(0, 3): 1, (1, 3): -1}
    actions = {
        (0, 0): ('D', 'R'),
        (0, 1): ('L', 'R'),
        (0, 2): ('L', 'D', 'R'),
        (1, 0): ('U', 'D'),
        (1, 2): ('U', 'D', 'R'),
        (2, 0): ('U', 'R'),
        (2, 1): ('L', 'R'),
        (2, 2): ('L', 'R', 'U'),
        (2, 3): ('L', 'U'),
      }
    g.set(rewards, actions)
    return g


def negative_grid(step_cost=-0.1):
    # in this game we want to try to minimize the number of moves
    # so we will penalize every move
    g = standard_grid()
    g.rewards.update({
    (0, 0): step_cost,
    (0, 1): step_cost,
    (0, 2): step_cost,
    (1, 0): step_cost,
    (1, 2): step_cost,
    (2, 0): step_cost,
    (2, 1): step_cost,
    (2, 2): step_cost,
    (2, 3): step_cost,
    })
    return g



In [2]:
SMALL_ENOUGH=10**-4
def print_values(V,g):
    for i in range(g.width):
        print("-------------------------")
        for j in range(g.height):
            v=V.get((i,j),0)
            if v>=0:
                print(" %.2f|" % v, end="")
            else:
                print("%.2f|" % v, end="")
        print("")
        
def print_policy(P,g):
    for i in range(g.width):
        print("")
        print("----------------")
        for j in range(g.height):
            p=P.get((i,j),' ')
            print(" %s |" % p,end="")

In [None]:
ALL_POSSIBLE_ACTIONS=('U','D','L','R')
GAMMA=0.9
ALPHA=0.1
SMALL_ENOUGH =10e-4
def random_action(a,eps=.1):
    p=np.random.random()
    if p<(1-eps):
        return a
    else:
        return np.random.choice(ALL_POSSIBLE_ACTIONS)

#
def play_game(grid,policy):
    s=(2,0)
    grid.set_state(s)
    state_reward=[(s,0)]
    while not grid.game_over():
        a=policy[s]
        a=random_action(a)
        r=grid.move(a)
        s=grid.current_state()
        state_reward.append((s,r))
    return state_reward

    

In [3]:
grid=stardard_grid()
# print rewards
print("rewards:")
print_values(grid.rewards, grid)

# state -> action
policy = {
    (2, 0): 'U',
    (1, 0): 'U',
    (0, 0): 'R',
    (0, 1): 'R',
    (0, 2): 'R',
    (1, 2): 'U',
    (2, 1): 'L',
    (2, 2): 'U',
    (2, 3): 'L',
  }

V = {}
states = grid.all_states()
for s in states:
    V[s] = 0
    
#now that everything is initialized we start the process of going through states
for t in range(5000):
    #generate episode
    state_reward=play_game(grid,policy)
    #ignore first one
    for t in range(len(state_reward)-1):
        s,_ = state_reward[t]
        s2,r=state_reward[t+1]
        V[s]=V[s]+ALPHA*(r+GAMMA*V[s2]-V[s])
print("values")
print("")
print_values(V,grid)
print("policy")
print_policy(policy,grid)
        
       

NameError: name 'stardard_grid' is not defined