# Introduction to TD and SARSA-Learning 

## CSCI E-82A

## Stephen Elston


In the previous lesson we explored Monte Carlo **reinforcement learning**. MC RL required that the returns for an entire episode be computed before any values are available for use. The disadvantage of this approach is the the full set of returns are required for state value or action value estimates. But, how can we get state values or action-values in fewer time steps? It turns out there are algorithms which compute estimates in as few as one step. In this lesson will focus on a state-value estimation algorithm known as **time difference learning** or **TD-learning** and control algorithm known as **SARSA**. 

Recall that reinforcement learning has several distinctive characteristics, which differentiate this method from other machine learning and dynamic programming:
- **No Markov model** needs to be specified for reinforcement learning, in contrast to dynamic programming.
- Like dynamic programming, reinforcement learning **optimizes a reward function**. This is in contrast to supervised and unsupervised learning which use an error or objective function.  
- Reinforcement learning algorithms learn by **experience**. Over time, the algorithm learns a model of the environment and these results are used to optimize the expected reward. Learning from experience is in contrast to supervised learning which uses known marked cases. 
- Reinforcement learning agents take **actions** and only receive **state** and **rewards** from the environment. These are the only interaction between the RL agent and the environment.    

The interaction between a reinforcement learning agent and the environment are illustrated in the figure below. Notice that the only feedback the agent receives from the environment is reward and state.   

<img src="img/RL_AgentModel.JPG" alt="Drawing" style="width:500px; height:300px"/>
<center> **Reinforcement Learning Agent and Environment** </center>  

The ability to learn from experience is an attractive concept. This method of learning seems to mimic human learning. However, reinforcement learning has proven difficult to use in real-world applications. For a review of successes and problems arising when applying RL to robotics see [Kobler et. al.](https://www.ias.informatik.tu-darmstadt.de/uploads/Publications/Kober_IJRR_2013.pdf). At the present time, RL has mostly succeeded in cases where simulations can be used to gain experience. 

**Suggested readings** for TD and Q reinforcement learning, Chapters 6 and 7 of Sutton and Barto, second edition, provides a good introductions, including many alternative algorithms and details not discussed here.   

## TD Prediction Model

In a previous lesson we examined a general update model:

$$NewValue = OldValue + LearningRate * ErrorTerm$$

Following this general formulation the update for TD state value can be written as:

$$V_{t+1}(S_{t}) = V_t(S_t) + \alpha \big[ G_t - V_t(S_t) \big]$$

Here,   
$\alpha = $ the learning rate,      
$\big[ G_t - V_t(S_t) \big] = $ the TD error term is the difference between the return $G_t$ and the value, $V(S_t)$. The return is identical to the state value at convergence, making the error 0.   

   
Using the same general formulation we can create and single time step update model know as the **one step time difference** or **TD(0)** algorithm.

$$V_{t+1}(S_{t}) = V_t(S_t) + \alpha \big[ R_{t+1} + V_{t+1}(S_{t+1}) - V_t(S_t) \big]$$

Where,  
$\delta_t =  R_{t+1} + V_{t+1}(S_{t+1}) - V_t(S_t) = $ the one-step **TD error**,  
$R_{t+1} = $ the return for the next time step,   
$V_t(S_t) = $ is the state-value at time step t,   
$V_{t+1}(S_{t+1}) = $ the bootstrap state-value for the successor state, $S_{t+1}$.

Like dynamic programming algorithms, the TD algorithm **bootstraps**. The return and estimated value at the next time step, $V_{t+1}(S_{t+1})$, are from previous samples. However, unlike MC RL which does not bootstrap, the TD(0) algorithm produces an estimate of $V_{t+1}(S_t)$ in only one time step, rather than waiting to reach the terminal state at the end of the episode. This fact can be seen by examining the backup diagram shown below:

<img src="img/TD0.JPG" alt="Drawing" style="width:60px; height:150px"/>
<center> **Backup Diagram of TD(0)** </center>

## Example of Time Difference RL

With this short introduction TD RL in mind, let's try an example. We will sample the value function using a basic TD(0) algorithm here. 

As discussed in other labs, **Navigation** to a goal is a significant problem in robotics. Real-world navigation is rather complex. Therefore, in this example we will use a simple analog called a **grid world**. The grid world for this problem is shown below. 

<img src="img/GridWorld.JPG" alt="Drawing" style="width:200px; height:200px"/>
<center> **A 4x4 Grid World with Terminal State** </center>

The grid world consists of a 4x4 set of positions the robot can occupy. Each position is considered a state. The goal is to navigate to state 0, the goal, in the minimum steps. We will explore methods to find policies which reach this goal and achieve maximum reward. 

Grid position 0 is the goal and a **terminal state**. There are no possible state transitions out of this position. The presence of a terminal state makes this an **episodic Markov random process**. For each episode sampled the robot can start in any other random position, $\{ 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 \}$. This random selection process makes this a **random start** TD algorithm. The episode terminates when the robot enters the terminal position (state 0).  

In reality, an RL agent may need to explore to find the possible actions when it is in some particular state. To simplify our example, we encode, or represent, these possibilities in a dictionary as shown in the code block below. We use a dictionary of dictionaries to perform the lookup. The keys of the outer dictionary are the identifiers (numbers) of the states. The keys of the inner dictionary are the possible actions and the values are the **successor state**, $s'$, for that transition.  

In each state, there are four possible actions the robot can take:
- up, u
- down, d,
- left, l
- right, r

The TD RL agent has no model for the environment. Therefore, beyond these allowed actions, all other information is encapsulated in the environment and is unobservable by the agent. This is the key difference between reinforcement learning and dynamic programming. 

In [1]:
## import numpy for latter
import numpy as np
import numpy.random as nr
import pandas as pd

## Define the transition dictonary of dictionaries:
neighbors = {0:{'u':0, 'd':0, 'l':0, 'r':0},
          1:{'u':1, 'd':5, 'l':0, 'r':2},
          2:{'u':2, 'd':6, 'l':1, 'r':3},
          3:{'u':3, 'd':7, 'l':2, 'r':3},
          4:{'u':0, 'd':8, 'l':4, 'r':5},
          5:{'u':1, 'd':9, 'l':4, 'r':6},
          6:{'u':2, 'd':10, 'l':5, 'r':7},
          7:{'u':3, 'd':11, 'l':6, 'r':7},
          8:{'u':4, 'd':12, 'l':8, 'r':9},
          9:{'u':5, 'd':13, 'l':8, 'r':10},
          10:{'u':6, 'd':14, 'l':9, 'r':11},
          11:{'u':7, 'd':15, 'l':10, 'r':11},
          12:{'u':8, 'd':12, 'l':12, 'r':13},
          13:{'u':9, 'd':13, 'l':12, 'r':14},
          14:{'u':10, 'd':14, 'l':13, 'r':15},
          15:{'u':11, 'd':15, 'l':14, 'r':15}}

To simulate the environment, we need a reward structure. In this case, the robot receives the following rewards:   

- 10 for entering position 0. 
- -1 for attempting to leave the grid. In other words, we penalize the robot for hitting the edges of the grid.  
- -0.1 for all other state transitions, which is the cost for the robot to move from one state to another. If we did not have this penalty, the robot could follow any random plan to the goal which did not hit the edges. 

This **reward structure is unknown to the TD RL agent**. The agent must **learn** the rewards by sampling the environment. Here the rewards are in the form of action values.    

We encode these rewards in the same type of dictionary structure used for the foregoing structures. 

In [2]:
rewards = {0:{'u':10.0, 'd':10.0, 'l':10.0, 'r':10.0},
          1:{'u':-1, 'd':-0.1, 'l':10.0, 'r':-0.1},
          2:{'u':-1.0, 'd':-0.1, 'l':-0.1, 'r':-0.1},
          3:{'u':-1.0, 'd':-0.1, 'l':-0.1, 'r':-1.0},
          4:{'u':10.0, 'd':-0.1, 'l':-1.0, 'r':-0.1},
          5:{'u':-0.1, 'd':-0.1, 'l':-0.1, 'r':-0.1},
          6:{'u':-0.1, 'd':-0.1, 'l':-0.1, 'r':-0.1},
          7:{'u':-0.1, 'd':-0.1, 'l':-0.1, 'r':-1.0},
          8:{'u':-0.1, 'd':-0.1, 'l':-1.0, 'r':-0.1},
          9:{'u':-0.1, 'd':-0.1, 'l':-0.1, 'r':-0.1},
          10:{'u':-0.1, 'd':-0.1, 'l':-0.1, 'r':-0.1},
          11:{'u':-0.1, 'd':-0.1, 'l':-0.1, 'r':-1.0},
          12:{'u':-0.1, 'd':-1.0, 'l':-1.0, 'r':-0.1},
          13:{'u':-0.1, 'd':-1.0, 'l':-0.1, 'r':-0.1},
          14:{'u':-0.1, 'd':-1.0, 'l':-0.1, 'r':-0.1},
          15:{'u':-0.1, 'd':-1.0, 'l':-0.1, 'r':-1.0}}

As was done previously, we will use an environment simulator. The function is called with a state and action. It returns the next state, 's_prime' and 'reward'. 

To simplify the rest of the code in this notebook we are treating the dictionaries as global. In general, this would be considered poor programming practice. 

Execute the code in the cell below and observe the results from the test cases. 

In [3]:
def simulate_environment(s, action, neighbors = neighbors, rewards = rewards, terminal = 0):
    """
    Function simulates the environment
    returns s_prime and reward given s and action
    """
    s_prime = neighbors[s][action]
    reward = rewards[s][action]
    return (s_prime, reward, is_terminal(s_prime, terminal))

def is_terminal(state, terminal = 0):
    return state == terminal

## Test the function
for a in ['u', 'd', 'r', 'l']:
    print(simulate_environment(1, a))

(1, -1, False)
(5, -0.1, False)
(2, -0.1, False)
(0, 10.0, True)


### TD(0) Policy Evaluation

We have everything in place to perform TD(0) policy evaluation for the grid world. The code in the cell below implements the TD(0) algorithm and applies it to the policy in the grid world. Notice that the algorithm makes calls to the aforementioned `state_values` function to find information on successor state and rewards for a transition given a current state and action. Execute this code and examine the results. 

In [4]:
initial_policy = {0:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        1:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25}, 
                        2:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        3:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        4:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        5:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        6:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        7:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        8:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        9:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        10:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        11:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        12:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        13:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        14:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        15:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25}}

The code in the cell below performs a random start for the beginning of an episode. The random start cannot be in the terminal state. 

In [5]:
def start_episode(n_states):
    '''Function to find a random starting value for the episode
    that is not the terminal state'''
    state = nr.choice(range(n_states))
    while(is_terminal(state)):
         state = nr.choice(range(n_states))
    return state

## test the function to make sure never starting in terminal state
[start_episode(15) for _ in range(10)]

[6, 1, 4, 13, 6, 12, 10, 8, 5, 7]

The function in the cell below finds and action given the state. The probability of the action is determined by the policy. Given the action the next state, reward and a terminal flag are found.     

In [6]:
def take_action(state, policy, actions = {1:'u', 2:'d', 3:'l', 4:'r'}):
    '''Function takes action given state using the transition probabilities 
    of the policy'''
    ## Find the action given the transistion probabilities defined by the policy.
    action = actions[nr.choice(range(len(actions)), p = list(policy[state].values())) + 1]
    s_prime, reward, terminal = simulate_environment(state, action)
    return (action, s_prime, reward, terminal)

## Test function for several states
for s in range(16):
    print(take_action(s, initial_policy))

('l', 0, 10.0, True)
('u', 1, -1, False)
('u', 2, -1.0, False)
('u', 3, -1.0, False)
('d', 8, -0.1, False)
('d', 9, -0.1, False)
('u', 2, -0.1, False)
('u', 3, -0.1, False)
('l', 8, -1.0, False)
('d', 13, -0.1, False)
('d', 14, -0.1, False)
('l', 10, -0.1, False)
('r', 13, -0.1, False)
('l', 12, -0.1, False)
('u', 10, -0.1, False)
('r', 15, -1.0, False)


The function in the cell below computes the state-value using the one-step TD algorithm. The loop in the function iterates over the samples to update the state-values. You can see additional details by reading the code comments.    

In [8]:
def td_0_state_values(policy, n_samps, alpha = 0.05, gamma = 1.0):
    """
    Function for TD(0) policy evalutation
    """
    
    ## Find the starting state
    n_states = len(policy)
    current_state = start_episode(n_states)
    terminal = False
    
    ## Array for state values
    v = np.zeros((n_states,1))
    
    for _ in range(n_samps):
        ## Find the next action and reward
        action, s_prime, reward, terminal = take_action(current_state, policy)
        ## Compute the TD error
        delta = reward + gamma*v[s_prime] - v[current_state]
        ## Update the state value
        v[current_state] = v[current_state] + alpha*delta
        current_state = s_prime
        if(terminal): ## start new episode when terminal
            current_state = start_episode(n_states)
    return(v)

td_0_state_values(initial_policy, 20000).reshape((4,4))        

array([[ 0.        ,  1.31869042, -2.11880104, -4.93612976],
       [ 1.43272279, -0.90332588, -3.50227127, -4.89284636],
       [-2.00494501, -3.19573127, -4.17354972, -5.39737062],
       [-4.23582102, -4.75961541, -5.39976187, -6.42133071]])

## One Step SARSA Algorithm

Now that we have examined the one step TD(0) algorithm for policy (value) evaluation, we need to define a one step algorithm for **control** or **policy improvement**.     

The **state action reward state action** or **SARSA** algorithm is an **on policy** method which uses time differencing to evaluate action values. As you might imagine from the name, this algorithm starts with an **on policy** action from the current state which results in a state transition and a reward. The backup diagram for one step SARSA (SARSA(0)) is shown in the figure below. 

<img src="img/SARSA.JPG" alt="Drawing" style="width:75px; height:150px"/>
<center> **Backup Diagram for SARSA(0)** </center>

Compare the backup diagram for SARSA(0) to the one for TD(0) notice that the order of state and action are reversed between the two algorithms. This realization is a good way to understand the difference. 

The update equation for SARSA(0) is as follows:

$$Q_{t+1}(S_{t},A_{t}) = Q_{t}(S_t,A_t) + \alpha \big[ R_{t+1} + \gamma Q_{t}(S_{t+1},A_{t+1}) - Q_t(S_t,A_t) \big]$$  

Where,   
$Q_t(S_t,A_t) = $ is the action value in state S given action A at step t,   
$R_{t+1} = $ is the reward for the next time step,    
$\delta_t = R_{t+1} + \gamma Q_{t}(S_{t+1},A_{t+1}) - Q_t(S_t,A_t) = $ TD error,   
$Q_{t}(S_{t+1},A_{t+1}) = $ action-value of successor action, $A_{t+1}'$, from the successor state, $S_{t+1}$,   
$\alpha = $ the learning rate,   
$\gamma = $ discount factor.  

### SARSA(0) Example

The code in the cell below implements the SARSA(0) algorithm to compute the action values for the Grid world given an policy.   

1. The `print_Q` function is a helper function to neatly print policies.    
2. ........................................

Additional details on this algorithm can be seen by reading the code comments.  

Execute this code to compute and print the action values for the random walk policy.     

In [9]:
def print_Q(Q):
    Q = pd.DataFrame(Q, columns = ['up', 'down', 'left', 'right'])
    print(Q)

def new_episode(n_states, policy):
    '''This function provides a start for a TD
    episode making sure the first transition is not 
    the termnal state'''
    current_state = start_episode(n_states)
    ## Find fist action and reward
    action, s_prime, reward, terminal = take_action(current_state, policy)
    return(current_state, action, s_prime, reward, terminal)    


def SARSA_0(policy, n_samps, alpha = 0.02, gamma = 1.0, action_index = {'u':0, 'd':1, 'l':2, 'r':3}):
    """
    Function for TD(0) policy evalutation
    """
    
    ## Find the starting state
    n_states = len(policy)
    current_state, action, s_prime, reward, terminal = new_episode(n_states, policy)
    action_idx = action_index[action]
    
    ## Array for state values
    q = np.zeros((n_states, len(policy[0])))
    
    for _ in range(n_samps):
        ## Find the next action and reward
        action_prime, s_prime_prime, reward_prime, terminal_prime = take_action(s_prime, policy)
        action_idx_prime = action_index[action_prime]
        ## Compute the TD error
        delta = reward + gamma*q[s_prime, action_idx_prime] - q[current_state, action_idx]
        ## Update the action values
        q[current_state, action_idx] = q[current_state, action_idx] + alpha*delta
        ## Update the state, action and reward for the next time step
        current_state = s_prime
        s_prime = s_prime_prime
        action = action_prime
        reward = reward_prime
        terminal = terminal_prime
        action_idx = action_idx_prime

        ## Check if end of episode
        if(terminal): 
            ## start new episode
            current_state, action, s_prime, reward, terminal = new_episode(n_states, policy)        
    return(q)


Q = SARSA_0(initial_policy, 20000, alpha = 0.2, gamma = 0.99)
print_Q(Q)

          up       down      left      right
0   0.000000   0.000000  0.000000   0.000000
1  -2.946151  -4.679119  1.766825  -7.055880
2  -9.304375  -7.531620 -2.673374  -9.243154
3  -9.855525  -8.549178 -7.550794  -9.700567
4   2.192268  -7.065908 -3.856339  -4.963840
5  -4.924289  -7.354114 -2.681596  -6.955821
6  -6.809831  -8.078507 -5.905157  -8.315024
7  -8.603230  -8.648365 -7.046689  -9.218676
8  -3.040000  -7.967481 -6.049473  -7.777612
9  -5.853649  -8.709268 -6.332785  -8.253191
10 -6.920516  -8.407800 -7.477266  -9.500826
11 -9.315121  -9.722783 -7.789940 -10.088669
12 -4.538828  -8.300985 -8.376000  -8.821630
13 -7.847931  -9.131746 -7.287943  -8.803079
14 -8.076055  -9.221894 -8.093935  -9.553308
15 -9.419925 -10.999737 -8.306114 -10.519434


The arrays printed above display the action values for each state. The four arrays represent the four possible actions, up, down, left and right. 

### GPI with SARSA(0)

The code in the cell below performs **general policy improvement (GPI)** using SARSA(0) to evaluate action values. Several cycles of SARSA(0) and policy improvement are performed with this code. Notice that policy improvement is $\epsilon$-greedy. The probability of any state transition will never go to zero, allowing continued exploration. Additional details on this algorithm can be seen by reading the code comments.  

Execute this code and examine the results.  

In [10]:
def update_policy(policy, Q, epsilon, action_index = {'u':0, 'd':1, 'l':2, 'r':3}):
    '''Updates the policy based on estiamtes of Q using 
    an epslion greedy algorithm. The action with the highest
    action value is used.'''
    
    ## Find the keys for the actions in the policy
    keys = list(policy[0].keys())
    
    ## Iterate over the states and find the maximm action value.
    for state in range(len(policy)):
        ## First find the index of the max Q values  
        q = Q[state,:]
        max_action_index = np.where(q == max(q))[0]

        ## Find the probabilities for the transitions
        n_transitions = float(len(q))
        n_max_transitions = float(len(max_action_index))
        p_max_transitions = (1.0 - epsilon *(n_transitions - n_max_transitions))/(n_max_transitions)
  
        ## Now assign the probabilities to the policy as epsilon greedy.
        for key in keys:
            if(action_index[key] in max_action_index): policy[state][key] = p_max_transitions
            else: policy[state][key] = epsilon
    return(policy)                

update_policy(initial_policy, Q, 0.01)    

{0: {'u': 0.25, 'd': 0.25, 'l': 0.25, 'r': 0.25},
 1: {'u': 0.01, 'd': 0.01, 'l': 0.97, 'r': 0.01},
 2: {'u': 0.01, 'd': 0.01, 'l': 0.97, 'r': 0.01},
 3: {'u': 0.01, 'd': 0.01, 'l': 0.97, 'r': 0.01},
 4: {'u': 0.97, 'd': 0.01, 'l': 0.01, 'r': 0.01},
 5: {'u': 0.01, 'd': 0.01, 'l': 0.97, 'r': 0.01},
 6: {'u': 0.01, 'd': 0.01, 'l': 0.97, 'r': 0.01},
 7: {'u': 0.01, 'd': 0.01, 'l': 0.97, 'r': 0.01},
 8: {'u': 0.97, 'd': 0.01, 'l': 0.01, 'r': 0.01},
 9: {'u': 0.97, 'd': 0.01, 'l': 0.01, 'r': 0.01},
 10: {'u': 0.97, 'd': 0.01, 'l': 0.01, 'r': 0.01},
 11: {'u': 0.01, 'd': 0.01, 'l': 0.97, 'r': 0.01},
 12: {'u': 0.97, 'd': 0.01, 'l': 0.01, 'r': 0.01},
 13: {'u': 0.01, 'd': 0.01, 'l': 0.97, 'r': 0.01},
 14: {'u': 0.97, 'd': 0.01, 'l': 0.01, 'r': 0.01},
 15: {'u': 0.01, 'd': 0.01, 'l': 0.97, 'r': 0.01}}

In [10]:
def SARSA_GPI(policy, n_samples, n_cycles, epsilon = 0.1, n_actions = 4):
    '''Function perfoms GPI using Monte Carlo value estimation.
    Updates to policy are epsilon greedy to prevent the algorithm
    from being trapped at some point.'''
    Q = np.zeros((len(policy), n_actions))
    ## Iterate over the required number of cycles
    for _ in range(n_cycles):
        Q = SARSA_0(policy, n_samples, alpha = 0.2, gamma = 0.99)
        policy = update_policy(policy, Q, epsilon = epsilon)
    return(policy)

improved_policy = SARSA_GPI(initial_policy, 100, 50, epsilon = 0.1)  
for state in range(16):
    print(improved_policy[state])

{'u': 0.25, 'd': 0.25, 'l': 0.25, 'r': 0.25}
{'u': 0.1, 'd': 0.1, 'l': 0.7, 'r': 0.1}
{'u': 0.3, 'd': 0.3, 'l': 0.1, 'r': 0.3}
{'u': 0.25, 'd': 0.25, 'l': 0.25, 'r': 0.25}
{'u': 0.1, 'd': 0.1, 'l': 0.7, 'r': 0.1}
{'u': 0.1, 'd': 0.1, 'l': 0.7, 'r': 0.1}
{'u': 0.3, 'd': 0.1, 'l': 0.3, 'r': 0.3}
{'u': 0.3, 'd': 0.3, 'l': 0.1, 'r': 0.3}
{'u': 0.7, 'd': 0.1, 'l': 0.1, 'r': 0.1}
{'u': 0.1, 'd': 0.1, 'l': 0.1, 'r': 0.7}
{'u': 0.3, 'd': 0.3, 'l': 0.3, 'r': 0.1}
{'u': 0.7, 'd': 0.1, 'l': 0.1, 'r': 0.1}
{'u': 0.1, 'd': 0.1, 'l': 0.1, 'r': 0.7}
{'u': 0.7, 'd': 0.1, 'l': 0.1, 'r': 0.1}
{'u': 0.1, 'd': 0.7, 'l': 0.1, 'r': 0.1}
{'u': 0.7, 'd': 0.1, 'l': 0.1, 'r': 0.1}


## N-Step Time Differencing

In the previous lesson we examined Monte Carlo RL which requires that each episode reach a terminal state before a value or action value estimate can be made. In the first part of this lesson we have focused on one-step time differencing, TD, algorithms which update estimates at each time step. Now, we will look at algorithms between these two extreme cases, n-step time differencing algorithms. 

We can generalize the concept of the one-step bootstrapping of the return starting with TD(0):

$$G_{t} = R_{t+1} + \gamma V_{t}(S_{t+1})$$

Here, $V_{t}(S_{t+1})$ is being bootstrapped. 

We can extend this formulation to two-step TD as follows:

$$G_{t:t+2} = R_{t+1} + \gamma R_{t+2} + \gamma^2 V_{t+1}(S_{t+2})$$

In the above it is $V_{t+1}(S_{t+2})$ being bootstrapped. 

For n-step TD the return is computed by the following bootstrap formulation. 

$$G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \ldots + \gamma^{n-1} R_{t+n} + \gamma^n V_{t+n-1}(S_{t+n})$$

Regardless of home many steps are used to compute the return, the value update is:

$$V_{t+n}(S_t) = V_{t+n-1}(S_t) + \alpha \big[ G_{t:t+n} - V_{t+n-1}(S_t) \big],\ 0 \leq t < T]$$

Where $\delta_t = G_{t:t+n} - V_{t+n-1}(S_t) = $ the n-step TD error.

The resulting backup diagrams for general TD(n) is shown in the diagram below. 

<img src="img/TDN.JPG" alt="Drawing" style="width:400px; height:350px"/>
<center> **Backup Diagram for n-step TD** </center>

From Left to right you can see how the backup diagram adds more states and actions. On the extreme right the backup diagram for the MC algorithm is shown. Notice that MC RL does not bootstrap, but samples until the terminal state is reached. 

N-step TD methods are generally considered to have two advantages over single-step methods:
1. N-step methods generally have better convergence properties. 
2. N-step methods break the link between time between actions and time when bootstraping occurs.   

### Example of TD(n) for Policy Evaluation

The code in the cell below implements TD(n) policy evaluation. As you can see, the bookkeeping is a bit involved. We must account for the cases where fewer than n steps have been executed at the start of an episode or when the terminal state has been reached. You can find further details of the algorithm by reading the code comments. 

Execute this code for n=4 and examine the results.

In [11]:
def TD_n(policy, episodes, n, alpha = 0.2, gamma = 0.9, epsilon = 0.1, action_index = {'u':0, 'd':1, 'l':2, 'r':3}):
    """
    Function to perform TD(N) policy evaluation.
    """
    ## Initialize the state list and action values
#    action_index = list(range(len(list(policy[0].keys()))))
    states = list(policy.keys())
    n_states = len(states)
    n_actions = len(policy[0].keys())
    v = np.zeros((n_states))
    
    for _ in range(episodes):
        ## Initialize variables
        T = float("inf")
        tau = 0
        t = 0
        rewards = []
       
        ## Get the random initial state
        current_state = start_episode(n_states)   
        state = [current_state]
        ## Initial action
        action, s_prime, reward, terminal = take_action(current_state, policy)
        state.append(s_prime)
        
        while(not (tau == T - 1)):
            if(t < T):
                ## Append the reward to the list
                rewards.append(reward)             
                if(terminal): 
                    ## update T if at terminal state
                    T = t + 1
                else: 
                    ## Get the next action state and rewards
                    action_prime, s_prime_prime, reward_prime, terminal_prime = take_action(current_state, policy)
                    state.append(s_prime_prime)
                      
            ## Update tau
            tau = t - n + 1
            
            if(tau > 0):
                G = 0.0
                for i in range(tau + 1, min(tau + n, T)):
                    exponent = i + tau - 1
                    G = G + gamma**exponent * rewards[i]
                if(tau + n < T): G += gamma**n * v[state[tau + n]] 
                v[state[tau]] = v[state[tau]] + alpha * (G - v[state[tau]])    
                
            
            ## Update variables for the next step
            t += 1
            current_state = s_prime
            if(not terminal):
                action = action_prime
                s_prime = s_prime_prime
                reward = reward_prime
                terminal = terminal_prime
    return(v)     

TD_n(initial_policy, 5000, 2).reshape((4,4))

array([[ 0.        ,  0.64723536,  0.05213313, -0.11663536],
       [ 0.24939185,  0.20241218, -0.00685404, -0.07291205],
       [ 0.05925891, -0.04600807, -0.1356987 , -0.07551572],
       [-0.19113176, -0.09674084, -0.32695795, -0.1806441 ]])

## N-step SARSA for control

Just as we used SARSA(0) for the control or policy improvement, we can generalize this algorithm to the n-step case. This is the SARSA(n) algorithm.  

The n-step return is computed by the following bootstrap formulation. 

$$G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \ldots + \gamma^{n-1} R_{t+n} + \gamma^n Q_{t+n-1}(S_{t+n}, A_{t+n}),\ n \geq 1, 0 \leq t < T-n$$

Using this return, the action value update becomes:

$$Q_{t+n}(S_t, A_t) = Q_{t+n-1}(S_t, A_t) + \alpha \big[ G_{t:t+n} - Q_{t+n-1}(S_t,A_t) \big],\ 0 \leq t < T]$$

Where $\delta_t =  G_{t:t+n} - Q_{t+n-1}(S_t,A_t) = $ the n-step error.

Backup diagrams for n-step SARSA are shown in the figure below. These backups reverse the order of state and action when compared to the n-step TD prediction algorithm discussed above. 

<img src="img/SARSAN.JPG" alt="Drawing" style="width:400px; height:350px"/>
<center> **Backup Diagram for n-step SARSA** </center>

### Example of N-Step SARSA

The code in the cell below implements the n-step SARSA algorithm. As is the case for the n-step TD algorithm, the bookkeeping is a bit involved. We must account for the cases where fewer than n steps have been executed at the start of an episode or when the terminal state has been reached. You can find further details of the algorithm by reading the code comments.

Execute this code for n=4 and examine the results.

In [12]:
def SARSA_n(policy, episodes, n, alpha = 0.2, gamma = 0.9, epsilon = 0.1, action_index = {'u':0, 'd':1, 'l':2, 'r':3}):
    """
    Function to perform TD(N) policy evaluation.
    """
    ## Initialize the state list and action values
#    action_index = list(range(len(list(policy[0].keys()))))
    states = list(policy.keys())
    n_states = len(states)
    n_actions = len(policy[0].keys())
    q = np.zeros((n_states, n_actions))
    
    for _ in range(episodes):
        ## Initialize variables
        T = float("inf")
        tau = 0
        t = 0
        rewards = []
       
        ## Get the random initial state
        current_state = start_episode(n_states)   
        state = [current_state]
        ## Initial action
        action, s_prime, reward, terminal = take_action(current_state, policy)
        state.append(s_prime)
        
        while(not (tau == T - 1)):
            if(t < T):
                ## Append the reward to the list
                rewards.append(reward)             
                if(terminal): 
                    ## update T if at terminal state
                    T = t + 1
                else: 
                    ## Get the next action state and rewards
                    action_prime, s_prime_prime, reward_prime, terminal_prime = take_action(current_state, policy)
                    state.append(s_prime_prime)
                      
            ## Update tau
            tau = t - n + 1
            
            if(tau > 0):
                G = 0.0
                for i in range(tau + 1, min(tau + n, T)):
                    exponent = i + tau - 1
                    G = G + gamma**exponent * rewards[i]
                if(tau + n < T): G += gamma**n * q[state[tau + n], action_index[action_prime]] 
                q[state[tau], action_index[action]] = q[state[tau], action_index[action]] + alpha * (G - q[state[tau], action_index[action]])    
                
            
            ## Update variables for the next step
            t += 1
            current_state = s_prime
            if(not terminal):
                action = action_prime
                s_prime = s_prime_prime
                reward = reward_prime
                terminal = terminal_prime
    return(q)              
            
Q = SARSA_n(initial_policy, 1000, 2)
print_Q(Q)

          up      down      left     right
0   0.000000  0.000000  0.000000  0.000000
1   0.118093  0.024812  0.587167 -0.014276
2   0.264104 -0.017211  0.512032 -0.024566
3   0.067824 -0.023504  0.264647  0.001974
4   0.419833 -0.051450  0.153674 -0.021796
5   0.093856 -0.110238  0.079718  0.022047
6   0.183314 -0.083767  0.032039 -0.038164
7   0.002154 -0.075237  0.014428 -0.024871
8   0.174928 -0.272355 -0.167314 -0.079980
9   0.000043 -0.170337 -0.010919 -0.073485
10  0.013197 -0.132439 -0.097850 -0.077194
11 -0.029898 -0.129589 -0.067561 -0.136060
12 -0.083455 -0.502835 -0.460308 -0.200731
13 -0.076160 -0.409107 -0.218631 -0.172191
14 -0.075450 -0.440874 -0.222348 -0.133579
15 -0.066220 -0.190607 -0.136956 -0.212432


In [13]:
improved_policy = update_policy(initial_policy, Q, 0.1) 
for state in range(16):
    print(improved_policy[state])

{'u': 0.25, 'd': 0.25, 'l': 0.25, 'r': 0.25}
{'u': 0.1, 'd': 0.1, 'l': 0.7, 'r': 0.1}
{'u': 0.1, 'd': 0.1, 'l': 0.7, 'r': 0.1}
{'u': 0.1, 'd': 0.1, 'l': 0.7, 'r': 0.1}
{'u': 0.7, 'd': 0.1, 'l': 0.1, 'r': 0.1}
{'u': 0.7, 'd': 0.1, 'l': 0.1, 'r': 0.1}
{'u': 0.7, 'd': 0.1, 'l': 0.1, 'r': 0.1}
{'u': 0.1, 'd': 0.1, 'l': 0.7, 'r': 0.1}
{'u': 0.7, 'd': 0.1, 'l': 0.1, 'r': 0.1}
{'u': 0.7, 'd': 0.1, 'l': 0.1, 'r': 0.1}
{'u': 0.7, 'd': 0.1, 'l': 0.1, 'r': 0.1}
{'u': 0.7, 'd': 0.1, 'l': 0.1, 'r': 0.1}
{'u': 0.7, 'd': 0.1, 'l': 0.1, 'r': 0.1}
{'u': 0.7, 'd': 0.1, 'l': 0.1, 'r': 0.1}
{'u': 0.7, 'd': 0.1, 'l': 0.1, 'r': 0.1}
{'u': 0.7, 'd': 0.1, 'l': 0.1, 'r': 0.1}


In [14]:
TD_n(improved_policy, 2000, 4).reshape((4,4))

array([[0.        , 1.84495053, 2.21512632, 0.7417259 ],
       [2.97997195, 2.27434712, 1.09352897, 0.18825758],
       [3.28518268, 1.82703452, 0.52822306, 1.28544781],
       [2.47730771, 0.99785037, 0.87810141, 0.16149905]])

### GPI with N-step SARSA

The code in the cell below executes the GPI algorithm using n-step SARSA to evaluate the action values. Details of the algorithm can be found by reading the code comments. 

Execute the code using 50 cycles of 100 episodes and 4-step SARSA and examine the results. 

In [15]:
def SARSA_n_GPI(policy, n, cycles, episodes, goal, alpha = 0.2, gamma = 0.9, epsilon = 0.1):
    ## iterate over GPI cycles
    current_policy = copy.deepcopy(policy)
    for _ in range(cycles):
        ## Evaluate policy with SARSA
        Q = SARSA_n(policy, episodes, n, goal = goal, alpha = alpha, gamma = gamma, epsilon = epsilon)
        
        for s in list(current_policy.keys()): # iterate over all states
            ## Find the index action with the largest Q values 
            ## May be more than one. 
            max_index = np.where(Q[:,s] == max(Q[:,s]))[0]
            
            ## Probabilities of transition
            ## Need to allow for further exploration so don't let any 
            ## transition probability be 0.
            ## Some gymnastics are required to ensure that the probabilities 
            ## over the transistions actual add to exactly 1.0
            neighbors_len = float(Q.shape[0])
            max_len = float(len(max_index))
            diff = round(neighbors_len - max_len,3)
            prob_for_policy = round(1.0/max_len,3)
            adjust = round((epsilon * (diff)), 3)
            prob_for_policy = prob_for_policy - adjust
            if(diff != 0.0):
                remainder = (1.0 - max_len * prob_for_policy)/diff
            else:
                remainder = epsilon
                                                 
            for i, key in enumerate(current_policy[s]): ## Update policy
                if(i in max_index): current_policy[s][key] = prob_for_policy
                else: current_policy[s][key] = remainder   
                    
    return(current_policy)                    
 

SARSA_N_Policy = SARSA_n_GPI(policy, n = 4, cycles = 5, episodes = 1000, goal = 0, alpha = 0.2, epsilon = 0.1)
SARSA_N_Policy

NameError: name 'policy' is not defined

Now, execute the code in the cell below to evaluate the policy you have just computed using the n-step TD algorithm with n = 4. 

In [None]:
np.array(TD_n(SARSA_N_Policy, episodes = 1000, n = 4, goal = 0, alpha = 0.2, gamma = 0.9)).reshape((4,4))

## N-Step Off-Policy Learning with Importance Sampling

For n-step off-policy learning we update a target policy $\pi(A_t|S_t)$ using samples from a behavior policy $b(A_t|S_t)$. Since the two policies differ, the probabilities of an action given the state will undoubtedly differ. For example, the behavior policy can be exploratory whereas, the target policy is greedy. 

To account for the different probabilities of sampling we reweight by the **importance sampling ratio**. For an n-step algorithm at time step $t$ the importance sampling ratio can be expressed as:

$$\rho_{t:t + n -1} = \prod_{k=\tau}^{min(t + n -1,T-1)} \frac{\pi(A_k|S_k)}{b(A_k|S_k)}$$

The n-step TD update then becomes:

$$V_{t+n}(S_t) = V_{t+n-1}(S_t) + \alpha\ \rho_{t:t+n-1} \big[ G_{t:t+n} - V_{t+n-1}(S_t) \big],\ 0 \leq t < T]$$

And the SARSA update becomes:

$$Q_{t+n}(S_t, A_t) = Q_{t+n-1}(S_t, A_t) + \alpha\ \rho_{t:t+n-1} \big[ G_{t:t+n} - Q_{t+n-1}(S_t,A_t) \big],\ 0 \leq t < T]$$

For both of the above update equations consider the effect of importance sampling ratio. If the action given state is more likely under the target policy that the behavior policy, more weight is given to updating with the error term. However, If the action given state is less likely under the target policy that the behavior policy, less weight is given to updating with the error term. In this way, the weighting by the importance sampling ratio gives the correct updates for the target policy regardless of the transition probabilities of the behavior policy. 

> **NOte:** Considerably more detail on n-step off-policy RL algorithms can be found in Sutton and Barto, second edition, Sections 7.3, 7.4 and 7.5. 

#### Copyright 2018, 2019, Stephen F Elston. All rights reserved. 