# Q-Learning: Off-Policy Temporal Difference Control

Q-Learning is a powerful **model-free**, **off-policy** reinforcement learning algorithm.

It learns the **optimal action-value function** $Q^*(s, a)$, which tells us the expected reward of taking an action `a` in state `s`, and following the optimal policy thereafter.

---
### 🔍 Q-Learning Update Rule

$$Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha [r_{t+1} + \gamma \max_a Q(s_{t+1}, a) - Q(s_t, a_t)]$$

Where:
- $\alpha$ → Learning rate
- $\gamma$ → Discount factor
- $r_{t+1}$ → Reward from environment
- $\max_a Q(s_{t+1}, a)$ → Greedy estimate of the next state's best Q-value

---
### 🧩 Key Concepts
| Term | Meaning |
|------|----------|
| **Model-free** | Doesn’t need environment dynamics |
| **Off-policy** | Learns from greedy policy while exploring |
| **Bootstrapping** | Updates estimates using current estimates |

---
### 🧠 Example: Q-Learning in a Simple Grid Environment

In [None]:
import numpy as np
import random

# Define environment parameters
n_states = 6
actions = [0, 1]  # 0 = left, 1 = right
Q = np.zeros((n_states, len(actions)))
alpha = 0.1  # Learning rate
gamma = 0.9  # Discount factor
epsilon = 0.2  # Exploration rate

def step(state, action):
    if state == 0 or state == 5:
        return state, 0
    if action == 1:  # Move right
        next_state = state + 1
    else:  # Move left
        next_state = state - 1
    reward = 1 if next_state == 5 else 0
    return next_state, reward

# Training loop
for episode in range(200):
    state = 2  # Start from the middle
    while state not in [0, 5]:
        # Epsilon-greedy action selection
        if random.uniform(0, 1) < epsilon:
            action = random.choice(actions)
        else:
            action = np.argmax(Q[state])

        next_state, reward = step(state, action)

        # Q-learning update
        Q[state, action] = Q[state, action] + alpha * (
            reward + gamma * np.max(Q[next_state]) - Q[state, action]
        )

        state = next_state

print("Learned Q-Table:")
print(Q)
print("\nDerived Policy (0=Left, 1=Right):")
print(np.argmax(Q, axis=1))

---
### 📈 Visualizing Learning

We can visualize how Q-values evolve over time to show convergence toward optimal policy.

In [None]:
import matplotlib.pyplot as plt

# Plot Q-values for each state-action pair
plt.figure(figsize=(8, 5))
for a in range(len(actions)):
    plt.plot(Q[:, a], label=f'Action {a}')
plt.title('Learned Q-values by Action')
plt.xlabel('State')
plt.ylabel('Q-value')
plt.legend()
plt.grid(True)
plt.show()

---
### 🧩 Epsilon-Greedy Exploration
To balance **exploration** and **exploitation**, Q-Learning uses an *ε-greedy policy*:
- With probability `ε`, select a random action (exploration).
- With probability `1-ε`, select the best known action (exploitation).

---
### ⚙️ Q-Learning Algorithm Summary
1. Initialize Q(s, a) arbitrarily.
2. For each episode:
   - Initialize state `s`.
   - For each step:
     - Choose `a` using ε-greedy.
     - Take action → observe `r`, `s'`.
     - Update Q using TD rule.
     - `s ← s'`.
3. Derive policy: `π(s) = argmax_a Q(s, a)`.

---
### 🚀 Advantages
- Works without a model of the environment.
- Converges to the optimal policy under certain conditions.
- Can handle large or continuous state spaces (with Deep Q-Networks).

---
### 📚 References
- Sutton & Barto, *Reinforcement Learning: An Introduction*
- OpenAI Spinning Up: Q-Learning
- David Silver, *UCL RL Course* (Lecture 6)