The methods used to estimate V and Q require $|S|$ or $|S| \times |A|$ values, and approximation can be used for estimates with fewer values.  We do feature extraction to convert states s to features x ($x=\phi(s)$), then approximate V with a function with parameter theta ($f(x,\theta)\approx V(s)$).  These methods require differentiable models (linear regression and gradient descent).  Start with linear function approximation applied to MC prediction.  Feature engineering is important for this type of thing.

We will be using sqaured error as our cost function.

We replace the expected value of V with the sample mean in the error function:
$$
\text{Error} = \left[ \frac{1}{N} \sum_{i=1}^N G_{i,s} - \hat V(s) \right]^2
$$
With this error function, we can use stochastic gradient descent.  V has $\theta$ as a parameter, and in gradient descent we make use of the fact that the negative of the gradient points in the direction of steepest descent.  So to go to a local min starting at a:
$$
a_{k+1}=a_k-\gamma \nabla f(a_k)
$$

So to miniize error, we change theta by taking steps to minimize the gradient at rate alpha 
$$
\theta \leftarrow \theta - \alpha \frac{\partial E}{\partial \theta}
$$
In general, 
$$
\theta \leftarrow \theta + \alpha \left( G- \hat V \right) \frac{\partial E}{\partial \theta}
$$

For linear models, 
$$
\hat V(s,\theta) = \theta^T \phi (s)= \theta^T x
$$
$$
\frac{\partial \hat V}{\partial \theta}  = x
$$
thus
$$
\theta \leftarrow \theta + \alpha \left( G- \hat V \right) x
$$
Back in the MC case, $V(s)$ was itself the parameter (with grad 1) leading to the update equation we actually used
$$
V(s) \leftarrow  V(s) + \alpha \left( G_s - V(s) \right)
$$

Now consider finding the features.  We can treat states as categorical vectors and use one hot encoding.  Then the dimension of the resulting vector would be the number of states.  The problem is this uses the same number of parameters as the old way $|S|,$ therefore we do not use it except for debugging and testing.

For our gridworld example, we are going to let (i,j) be the 2 dimensional vector in x-y space, potentially scaled to mean 0 and variance 1.  However, we will use a polynomial model to allow for more expression.

We will start with a fixed policy (the prediction problem)

In [16]:
import numpy as np


class Grid: # Environment
    def __init__(self, width, height, start):
        self.width = width
        self.height = height
        self.i = start[0]
        self.j = start[1]

    def set(self, rewards, actions):
        # rewards should be a dict of: (i, j): r (row, col): reward
        # actions should be a dict of: (i, j): A (row, col): list of possible actions
        self.rewards = rewards
        self.actions = actions

    def set_state(self, s):
        self.i = s[0]
        self.j = s[1]

    def current_state(self):
        return (self.i, self.j)

    def is_terminal(self, s):
        return s not in self.actions

    def move(self, action):
        # check if legal move first
        if action in self.actions[(self.i, self.j)]:
            if action == 'U':
                self.i -= 1
            elif action == 'D':
                self.i += 1
            elif action == 'R':
                self.j += 1
            elif action == 'L':
                self.j -= 1
        # return a reward (if any)
        return self.rewards.get((self.i, self.j), 0)

    def undo_move(self, action):
    # these are the opposite of what U/D/L/R should normally do
        if action == 'U':
            self.i += 1
        elif action == 'D':
            self.i -= 1
        elif action == 'R':
            self.j -= 1
        elif action == 'L':
            self.j += 1
        # raise an exception if we arrive somewhere we shouldn't be
        # should never happen
        assert(self.current_state() in self.all_states())

    def game_over(self):
        # returns true if game is over, else false
        # true if we are in a state where no actions are possible
        return (self.i, self.j) not in self.actions

    def all_states(self):
        # possibly buggy but simple way to get all states
        # either a position that has possible next actions
        # or a position that yields a reward
        return set(list(self.actions.keys()) + list(self.rewards.keys()))


def standard_grid():
    # define a grid that describes the reward for arriving at each state
    # and possible actions at each state
    # the grid looks like this
    # x means you can't go there
    # s means start position
    # number means reward at that state
    # .  .  .  1
    # .  x  . -1
    # s  .  .  .
    g = Grid(3, 4, (2, 0))
    rewards = {(0, 3): 1, (1, 3): -1}
    actions = {
        (0, 0): ('D', 'R'),
        (0, 1): ('L', 'R'),
        (0, 2): ('L', 'D', 'R'),
        (1, 0): ('U', 'D'),
        (1, 2): ('U', 'D', 'R'),
        (2, 0): ('U', 'R'),
        (2, 1): ('L', 'R'),
        (2, 2): ('L', 'R', 'U'),
        (2, 3): ('L', 'U'),
      }
    g.set(rewards, actions)
    return g


def negative_grid(step_cost=-0.1):
    # in this game we want to try to minimize the number of moves
    # so we will penalize every move
    g = standard_grid()
    g.rewards.update({
    (0, 0): step_cost,
    (0, 1): step_cost,
    (0, 2): step_cost,
    (1, 0): step_cost,
    (1, 2): step_cost,
    (2, 0): step_cost,
    (2, 1): step_cost,
    (2, 2): step_cost,
    (2, 3): step_cost,
    })
    return g

In [17]:
SMALL_ENOUGH=10**-4
def print_values(V,g):
    for i in range(g.width):
        print("-------------------------")
        for j in range(g.height):
            v=V.get((i,j),0)
            if v>=0:
                print(" %.2f|" % v, end="")
            else:
                print("%.2f|" % v, end="")
        print("")
        
def print_policy(P,g):
    for i in range(g.width):
        print("")
        print("----------------")
        for j in range(g.height):
            p=P.get((i,j),' ')
            print(" %s |" % p,end="")

In [5]:
def random_action(a):
    p=np.random.random()
    if p<.5:
        return a
    else:
        tmp=list(ALL_POSSIBLE_ACTIONS)
        tmp.remove(a)
        return np.random.choice(tmp)
    

ALL_POSSIBLE_ACTIONS=('U','D','L','R')
GAMMA=0.9
def play_game(grid,policy):
    #return states and returns
    #reset to start at random posn,
    
    start_states=list(grid.actions.keys())
    start_idx=np.random.choice(len(start_states))
    grid.set_state(start_states[start_idx])
    
    s=grid.current_state()
    state_reward=[(s,0)] #state reward tuple
    while not grid.game_over():
        a=policy[s]
        a=random_action(a) 
        r=grid.move(a)
        s=grid.current_state()
        state_reward.append((s,r))
    G=0
    state_return=[]
    first = True
#     print(state_reward)
    for s,r in reversed(state_reward):
        if first:
            first = False
        else:
            #ignore first state bc value for terminal state is 0
            state_return.append((s,G))
        G=r+GAMMA*G
#     print(state_return)
    state_return.reverse()
    
    return state_return

In [12]:
grid = standard_grid()

# print rewards
print("rewards:")
print_values(grid.rewards, grid)
LEARNING_RATE=0.001


# state -> action
policy = {
    (2, 0): 'U',
    (1, 0): 'U',
    (0, 0): 'R',
    (0, 1): 'R',
    (0, 2): 'R',
    (1, 2): 'U',
    (2, 1): 'L',
    (2, 2): 'U',
    (2, 3): 'L',
  }

#initialize theta for our model V=theta.dot(x) with
#x=[row,col,row*col,1]-1 for bias term
theta=np.random.randn(4)/2
def s2x(s):
    return np.array([s[0]-1,s[1]-1,s[0]*s[1]-3,1])#why minus 3?

#repeat until converge
deltas=[]
t=1.0
for it in range(20000):
    if it % 100 ==0:
        t+=0.01
    alpha=LEARNING_RATE
    biggest_change=0
    state_return=play_game(grid,policy)
    seen_states=set()
    for s,G in state_return:
        if s not in seen_states:
            old_theta=theta.copy()
            x=s2x(s)
            V_hat=theta.dot(x)
            theta+=alpha*(G-V_hat)*x
            biggest_change=max(biggest_change,np.abs(old_theta-theta).sum())
            seen_states.add(s)
        deltas.append(biggest_change)
#now get the values to print out
V = {}
states = grid.all_states()
for s in states:
    if s in grid.actions.keys():
        V[s]=theta.dot(s2x(s))
    else:
  # terminal state or state we can't otherwise get to
        V[s] = 0
            
print("values")
print("")
print_values(V,grid)
print("policy")
print_policy(policy,grid)

rewards:
-------------------------
 0.00| 0.00| 0.00| 1.00|
-------------------------
 0.00| 0.00| 0.00|-1.00|
-------------------------
 0.00| 0.00| 0.00| 0.00|
values

-------------------------
 0.44| 0.54| 0.64| 0.00|
-------------------------
 0.34| 0.00| 0.29| 0.00|
-------------------------
 0.24| 0.09|-0.06|-0.21|
policy

----------------
 R | R | R |   |
----------------
 U |   | U |   |
----------------
 U | L | U | L |

Approximation for TD(0)

In TD we update before the end, using the reward and value for the next state instead of G.  So we are using the model output as a target to fix model parameters.  This is a semi-gradient method, because the target we are using is not a true target, and thus the gradient is not a true gradient.

In [22]:
#here we need the play game function from TD0
def random_action(a, eps=0.1):
    # we'll use epsilon-soft to ensure all states are visited
    # what happens if you don't do this? i.e. eps=0
    p = np.random.random()
    if p < (1 - eps):
        return a
    else:
        return np.random.choice(ALL_POSSIBLE_ACTIONS)

def play_game_td(grid, policy):
    # returns a list of states and corresponding rewards (not returns as in MC)
    # start at the designated start state
    s = (2, 0)
    grid.set_state(s)
    states_and_rewards = [(s, 0)] # list of tuples of (state, reward)
    while not grid.game_over():
        a = policy[s]
        a = random_action(a)
        r = grid.move(a)
        s = grid.current_state()
        states_and_rewards.append((s, r))
    return states_and_rewards

In [23]:
class Model:
    def __init__(self):
        self.theta=np.random.randn(4)/2
    def s2x(self,s):
        return np.array([s[0]-1,s[1]-1,s[0]*s[1]-3,1])
    def predict(self,x):
        x=self.s2x(s)
        return self.theta.dot(x)
    def grad(self,s):
        return self.s2x(s)


In [24]:
grid = standard_grid()
ALPHA=.001
# print rewards
print("rewards:")
print_values(grid.rewards, grid)
# state -> action
policy = {
    (2, 0): 'U',
    (1, 0): 'U',
    (0, 0): 'R',
    (0, 1): 'R',
    (0, 2): 'R',
    (1, 2): 'R',
    (2, 1): 'R',
    (2, 2): 'R',
    (2, 3): 'U',
  }
model=Model()
deltas=[]
k=1.0
for it in range(20000):
    if it % 10 ==0:
        k+=0.01
    alpha=ALPHA/k
    biggest_change=0
    
    state_reward = play_game_td(grid,policy)
    for t in range(len(state_reward)-1):
        s,_ = state_reward[t]
        s2,r=state_reward[t+1]
        old_theta = model.theta.copy()
        if grid.is_terminal(s2):
            target=r
        else: #here is where we use an estimate as target
            target = r+GAMMA*model.predict(s2)
        model.theta+=alpha*(target-model.predict(s))*model.grad(s)
        biggest_change=max(biggest_change,np.abs(old_theta-model.theta).sum())
    deltas.append(biggest_change)

V = {}
states = grid.all_states()
for s in states:
    if s in grid.actions.keys():
        V[s]=theta.dot(s2x(s))
    else:
  # terminal state or state we can't otherwise get to
        V[s] = 0
            
print("values")
print("")
print_values(V,grid)
print("policy")
print_policy(policy,grid)


rewards:
-------------------------
 0.00| 0.00| 0.00| 1.00|
-------------------------
 0.00| 0.00| 0.00|-1.00|
-------------------------
 0.00| 0.00| 0.00| 0.00|
values

-------------------------
 0.44| 0.54| 0.64| 0.00|
-------------------------
 0.34| 0.00| 0.29| 0.00|
-------------------------
 0.24| 0.09|-0.06|-0.21|
policy

----------------
 R | R | R |   |
----------------
 U |   | R |   |
----------------
 U | R | R | U |