#  TD($\lambda$)

In this notebook, study the TD($\lambda$) as a model-free RL algorithm that can interprolate between TD$(0)$ and MC. We first explain the mathematical concept behind it and implement it via eligibility traces.

In the theory part, we follow the lucid [monograph by Richard Sutton and Andrew Barto](http://incompleteideas.net/book/the-book-2nd.html), whereas the implementation relies on 

https://github.com/mpatacchiola/dissecting-reinforcement-learning/blob/master/src/3/temporal_differencing_prediction_trace.py

https://github.com/jiexunsee/Deep-Watkins-Q-and-Actor-Critic/blob/master/DeepTDLambdaLearner.py

## Code

First, we load the ``Frozen-Lake`` environment.

In [1]:
import gym

%matplotlib inline 

env = gym.make('FrozenLake-v0')

s = env.reset()
env.render()
env.step(env.action_space.sample())
env.render()


[41mS[0mFFF
FHFH
FFFH
HFFG
  (Right)
S[41mF[0mFF
FHFH
FFFH
HFFG


Then, we define the single steps in the $\mathsf{TD}(\lambda)$-scheme. The goal in this toy problem is to estimate the value function associated with taking random actions.

In [14]:
import numpy as np 
LR = .8
GAMMA = .99
LAMBDA = .5

def q_step(s, i):
    #get random action
    a = np.random.randint(env.nA)
    
    #Get new state and reward from environment
    ss, r, d, _ = env.step(a)
    
    #update eligibility
    global trace
    global V
    trace *= GAMMA * LAMBDA
    trace[s] += 1
    
    #update value
    delta = r + GAMMA * V[ss] - V[s]
    V += LR * delta * trace
    
    s = ss
    return s, r, d    

Now, we let the algorithm run and print the learned state-value function.

In [29]:
import numpy as np

NEPS = int(5e3)
rList = []
V = np.zeros(env.nS)

for i in range(NEPS):
    #Reset environment and get first new observation
    s = env.reset()
    trace = np.zeros(env.nS)
    rAll = 0
    
    #The Q-Table learning algorithm
    while True:
        s, r, d = q_step(s, i)
        rAll += r
        if(d):
            break
    rList.append(rAll)

In [30]:
print(V)

[ 0.00708151  0.01044591  0.00465903  0.00310812  0.00056628  0.
  0.00196414  0.          0.00329716  0.02576031  0.00958781  0.
  0.         -0.00784054  0.36166087  0.        ]


## $\mathsf{TD}(\lambda)$ for Optimal Policy

Now, we modify the $\mathsf{TD}(\lambda)$ algorithm to estimate the optimal policy.

In [6]:
import numpy as np 
LR = .8
GAMMA = .99
LAMBDA = .5
EPS  = .05

def q_step_opt(s, i):
    
    #update eligibility
    global q_trace
    global Q
    
    #select action epsilon-greedy
    a = np.argmax(Q[s,:] + np.random.randn(1, env.nA) * (EPS  + 1  / (i + 1)))
    
    #Get new state and reward from environment
    ss, r, d, _ = env.step(a)
    
    q_trace *= GAMMA * LAMBDA
    q_trace[s, a] += 1
    
    #update value
    delta = r + GAMMA * np.max(Q[ss, :]) - Q[s, a]
    Q += LR * delta * q_trace
    
    s = ss
    return s, r, d 

Again, we perform the fitting in several epochs.

In [10]:
import numpy as np

NEPS = int(1e4)
rList = []
Q = np.zeros((env.nS, env.nA))

for i in range(NEPS):
    #Reset environment and get first new observation
    s = env.reset()
    q_trace = np.zeros((env.nS, env.nA))
    rAll = 0
    
    #The Q-Table learning algorithm
    while True:
        s, r, d = q_step_opt(s, i)
        rAll += r
        if(d):
            break
    rList.append(rAll)