### Off-Policy TD Learning: Q-Learning

**Key Difference from SARSA**: Uses $\max Q(s',a')$ instead of $Q(s',a')$ from current policy.

**Q-Learning** is an off-policy temporal difference algorithm that learns the optimal action-value function regardless of the policy being followed.

**Key Formula**: $Q(s,a) \leftarrow Q(s,a) + \alpha[r + \gamma \max_{a'} Q(s',a') - Q(s,a)]$

In [1]:
import numpy as np
import gymnasium as gym

#### Epsilon-Greedy Action Selection

Same $\epsilon$-greedy policy as SARSA: balance exploration and exploitation during learning.

In [2]:
def epsilon_greedy(Q, state, epsilon, env):
    if np.random.rand() < epsilon:
        return env.action_space.sample() 
    else:
        return np.argmax(Q[state])

#### Q-Learning Algorithm

**Off-policy learning**: Updates Q-values using the greedy action (max) regardless of the action actually taken by the behavior policy.

In [3]:
def q_learning(env, 
               num_episodes=25000, 
               alpha=0.1, 
               gamma=0.99, 
               epsilon=0.1):

    Q = np.zeros((env.observation_space.n, env.action_space.n))

    for episode in range(num_episodes):
        
        state, _ = env.reset()

        action = epsilon_greedy(Q, state, epsilon, env)
            
        done = False
        while not done:

            next_state, reward, done, _, _ = env.step(action)

            next_action = epsilon_greedy(Q, next_state, epsilon, env)

            # Future action according to greedy policy (max)
            Q[state, action] = Q[state, action] + alpha * (reward + \
                                   gamma * np.max(Q[next_state]) - Q[state, action])
            
            state, action = next_state, next_action

    return Q

#### Train Q-Learning Agent

Run Q-learning on stochastic FrozenLake and extract the learned optimal policy.

In [4]:
env = gym.make("FrozenLake-v1", is_slippery=True)
Q_qlearning = q_learning(env)

policy = np.argmax(Q_qlearning, axis=-1)
print(policy)

[0 3 3 3 0 0 2 0 3 1 0 0 0 2 1 0]


#### Evaluate Learned Policy

Test the Q-learning derived policy's performance and compare with SARSA results.

In [5]:
def test_policy(policy, env, num_episodes=500):
    success_count = 0

    for _ in range(num_episodes):
        state, _ = env.reset()
        done = False

        while not done:
            action = policy[state]
            state, reward, done, _, _ = env.step(action)

            if done and reward == 1.0:  # Reached the goal
                success_count += 1

    success_rate = success_count / num_episodes
    print(f"Policy Success Rate: {success_rate * 100:.2f}%")

test_policy(policy, env)

Policy Success Rate: 81.20%
