In [None]:
# 🔧 Setup: Run this cell first!
# Check GPU availability and install dependencies

import torch
import sys

# Check GPU
if torch.cuda.is_available():
    device = torch.device('cuda')
    print(f"✅ GPU available: {torch.cuda.get_device_name(0)}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    device = torch.device('cpu')
    print("⚠️ No GPU detected. Some cells may run slowly.")
    print("   Go to Runtime → Change runtime type → GPU")

print(f"\n📦 Python {sys.version.split()[0]}")
print(f"🔥 PyTorch {torch.__version__}")

# Set random seeds for reproducibility
import random
import numpy as np

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

print(f"🎲 Random seed set to {SEED}")

%matplotlib inline

# Q-Learning from Scratch: Learning Without a Map

*Part 3 of the Vizuara series on Value Functions and Q-Learning*
*Estimated time: 45 minutes*

## 1. Why Does This Matter?

In the previous notebook, we solved the Bellman equations using value iteration. But there was a catch: we needed complete knowledge of the environment -- the transition dynamics, the reward function, everything.

In the real world, you rarely have this luxury. A robot does not know the exact physics of every surface it might walk on. A game-playing agent does not know the rules until it tries things and sees what happens.

**Q-Learning** solves this problem. It learns the optimal Q-values directly from experience -- by interacting with the environment, observing rewards, and updating its estimates one step at a time. No model required.

By the end of this notebook, you will:
- Implement Q-Learning from scratch
- Train an agent to solve FrozenLake (a classic RL benchmark)
- Understand exploration vs exploitation (epsilon-greedy)
- Visualize how Q-values evolve during training
- Watch your agent go from random stumbling to purposeful navigation

## 2. Building Intuition

Think of learning to navigate a new city without a map. You start by wandering randomly. Sometimes you find a great restaurant (positive reward). Sometimes you end up in a dead end (negative reward).

Over time, you build a mental map: "If I am at the train station and I go left, good things tend to happen." This mental map is your Q-table -- it stores, for every location and every direction, how good that combination has been in the past.

The key insight of Q-Learning is that you update this mental map after every single step, not just at the end of the trip. And crucially, when you update, you always ask: "What is the BEST thing I could do from where I ended up?" -- even if you did not actually do the best thing (maybe you explored instead).

### Think About This

If you always go to your favorite restaurant (exploitation), you might miss discovering an even better one. But if you always try random places (exploration), you waste many meals on bad food. How do you balance these two? This is the exploration-exploitation dilemma, and Q-Learning has an elegant solution.

## 3. The Mathematics

### The Q-Learning Update Rule

$$Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \, \max_{a'} Q(s', a') - Q(s, a) \right]$$

Let us break this down piece by piece:

- $Q(s, a)$: our current estimate of the value of taking action $a$ in state $s$
- $\alpha$: the learning rate -- how much we trust the new information (typically 0.01 to 0.5)
- $r + \gamma \max_{a'} Q(s', a')$: the **TD target** -- what we think the value should be based on what just happened
- $r + \gamma \max_{a'} Q(s', a') - Q(s, a)$: the **TD error** -- the gap between our target and our current estimate

Computationally: "I just took action $a$ in state $s$, got reward $r$, and landed in state $s'$. The best I can do from $s'$ is $\max_{a'} Q(s', a')$. So a reasonable estimate of Q(s,a) is $r + \gamma \cdot \max_{a'} Q(s', a')$. I nudge my old estimate toward this new one."

### Epsilon-Greedy Policy

$$a = \begin{cases} \text{random action} & \text{with probability } \epsilon \\ \arg\max_a Q(s, a) & \text{with probability } 1 - \epsilon \end{cases}$$

Start with high epsilon (lots of exploration), decay it over time (shift toward exploitation).

## 4. Let's Build It -- Component by Component

### 4.1 The Q-Learning Agent

In [None]:
import numpy as np
import matplotlib.pyplot as plt

class QLearningAgent:
    """A tabular Q-Learning agent."""

    def __init__(self, n_states, n_actions, alpha=0.1, gamma=0.99,
                 epsilon=1.0, epsilon_decay=0.995, epsilon_min=0.01):
        self.n_states = n_states
        self.n_actions = n_actions
        self.alpha = alpha        # Learning rate
        self.gamma = gamma        # Discount factor
        self.epsilon = epsilon    # Exploration rate
        self.epsilon_decay = epsilon_decay
        self.epsilon_min = epsilon_min

        # Initialize Q-table with zeros
        self.Q = np.zeros((n_states, n_actions))

        # Tracking
        self.td_errors = []

    def choose_action(self, state):
        """Epsilon-greedy action selection."""
        if np.random.random() < self.epsilon:
            return np.random.randint(self.n_actions)  # Explore
        else:
            return np.argmax(self.Q[state])            # Exploit

    def update(self, state, action, reward, next_state, done):
        """
        Q-Learning update: Q(s,a) <- Q(s,a) + alpha * [TD_error]
        where TD_error = r + gamma * max_a' Q(s',a') - Q(s,a)
        """
        # TD target: what we think Q(s,a) should be
        if done:
            td_target = reward  # No future from terminal state
        else:
            td_target = reward + self.gamma * np.max(self.Q[next_state])

        # TD error: how wrong our current estimate is
        td_error = td_target - self.Q[state, action]

        # Update Q-value
        self.Q[state, action] += self.alpha * td_error

        self.td_errors.append(abs(td_error))

    def decay_epsilon(self):
        """Reduce exploration rate."""
        self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)


# Create agent
agent = QLearningAgent(n_states=16, n_actions=4)
print(f"Q-table shape: {agent.Q.shape}")
print(f"Initial epsilon: {agent.epsilon}")
print(f"Learning rate: {agent.alpha}")
print(f"Discount factor: {agent.gamma}")

### 4.2 The FrozenLake Environment

In [None]:
import gymnasium as gym

# FrozenLake: 4x4 grid, slippery ice
# S = Start, F = Frozen (safe), H = Hole (fall, episode ends), G = Goal
env = gym.make("FrozenLake-v1", is_slippery=True)

print("FrozenLake-v1 (4x4, slippery)")
print(f"States: {env.observation_space.n}")
print(f"Actions: {env.action_space.n} (0=Left, 1=Down, 2=Right, 3=Up)")
print()
print("Map:")
print("S F F F")
print("F H F H")
print("F F F H")
print("H F F G")
print()
print("The ice is slippery! The agent may not move in the intended direction.")
print("Goal: reach G. Reward: +1 at goal, 0 everywhere else.")

### 4.3 Training Loop

In [None]:
def train_q_learning(env, agent, n_episodes=10000, verbose_every=1000):
    """Train the Q-Learning agent."""
    rewards_per_episode = []
    epsilon_history = []

    for episode in range(n_episodes):
        state, _ = env.reset()
        total_reward = 0
        done = False
        steps = 0

        while not done:
            # Choose action
            action = agent.choose_action(state)

            # Take action
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated

            # Q-Learning update
            agent.update(state, action, reward, next_state, terminated)

            total_reward += reward
            state = next_state
            steps += 1

        # Decay epsilon
        agent.decay_epsilon()

        rewards_per_episode.append(total_reward)
        epsilon_history.append(agent.epsilon)

        if (episode + 1) % verbose_every == 0:
            recent_success = np.mean(rewards_per_episode[-100:])
            print(f"Episode {episode+1:5d} | "
                  f"Success Rate (last 100): {recent_success:.2%} | "
                  f"Epsilon: {agent.epsilon:.4f}")

    return rewards_per_episode, epsilon_history


# Create fresh agent and train
agent = QLearningAgent(
    n_states=env.observation_space.n,
    n_actions=env.action_space.n,
    alpha=0.1,
    gamma=0.99,
    epsilon=1.0,
    epsilon_decay=0.995,
    epsilon_min=0.01,
)

rewards, epsilons = train_q_learning(env, agent, n_episodes=10000)
print(f"\nFinal success rate (last 100): {np.mean(rewards[-100:]):.2%}")

In [None]:
# Visualization: Training progress
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 8))

# Moving average of success rate
window = 100
moving_avg = [np.mean(rewards[max(0, i-window):i+1]) for i in range(len(rewards))]

ax1.plot(moving_avg, color='#2171b5', linewidth=1.5)
ax1.fill_between(range(len(moving_avg)), moving_avg, alpha=0.2, color='#2171b5')
ax1.set_xlabel('Episode', fontsize=12)
ax1.set_ylabel('Success Rate (100-ep avg)', fontsize=12)
ax1.set_title('Q-Learning Training Progress on FrozenLake', fontsize=14, fontweight='bold')
ax1.grid(True, alpha=0.3)
ax1.set_ylim(0, 1)

# Epsilon decay
ax2.plot(epsilons, color='#d94701', linewidth=1.5)
ax2.set_xlabel('Episode', fontsize=12)
ax2.set_ylabel('Epsilon', fontsize=12)
ax2.set_title('Exploration Rate Decay', fontsize=14, fontweight='bold')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### 4.4 Visualize the Learned Q-Table

In [None]:
def visualize_q_table(Q, title="Learned Q-Table"):
    """Visualize Q-values for FrozenLake."""
    fig, ax = plt.subplots(1, 1, figsize=(10, 10))
    ax.set_xlim(-0.5, 3.5)
    ax.set_ylim(3.5, -0.5)
    ax.set_aspect('equal')

    action_names = ['Left', 'Down', 'Right', 'Up']
    arrow_dx = [0, -0.3, 0, 0.3]  # Swapped for display: Left=-x, Right=+x
    arrow_dy = [0.3, 0, -0.3, 0]  # Up=-y, Down=+y

    # FrozenLake map
    lake_map = [
        ['S', 'F', 'F', 'F'],
        ['F', 'H', 'F', 'H'],
        ['F', 'F', 'F', 'H'],
        ['H', 'F', 'F', 'G'],
    ]

    for r in range(4):
        for c in range(4):
            state = r * 4 + c
            cell = lake_map[r][c]

            # Background color
            if cell == 'H':
                ax.add_patch(plt.Rectangle((c-0.5, r-0.5), 1, 1, facecolor='#ffcccc', edgecolor='black'))
                ax.text(c, r, 'HOLE', ha='center', va='center', fontsize=9, color='red', fontweight='bold')
            elif cell == 'G':
                ax.add_patch(plt.Rectangle((c-0.5, r-0.5), 1, 1, facecolor='#ccffcc', edgecolor='black'))
                ax.text(c, r, 'GOAL', ha='center', va='center', fontsize=9, color='green', fontweight='bold')
            elif cell == 'S':
                ax.add_patch(plt.Rectangle((c-0.5, r-0.5), 1, 1, facecolor='#cce5ff', edgecolor='black'))
            else:
                ax.add_patch(plt.Rectangle((c-0.5, r-0.5), 1, 1, facecolor='white', edgecolor='black'))

            if cell not in ['H', 'G']:
                # Draw arrow for best action
                best_a = np.argmax(Q[state])
                q_max = Q[state, best_a]
                if q_max > 0:
                    # Arrow direction mapping for FrozenLake: 0=Left, 1=Down, 2=Right, 3=Up
                    dx_map = [-0.3, 0, 0.3, 0]
                    dy_map = [0, 0.3, 0, -0.3]
                    ax.annotate('', xy=(c + dx_map[best_a], r + dy_map[best_a]),
                               xytext=(c, r),
                               arrowprops=dict(arrowstyle='->', color='#2171b5', lw=2.5))

                # Show Q-values in corners
                for a in range(4):
                    q_val = Q[state, a]
                    if q_val != 0:
                        offset_x = [-0.35, 0, 0.35, 0][a]
                        offset_y = [0, 0.35, 0, -0.35][a]
                        ax.text(c + offset_x, r + offset_y, f'{q_val:.2f}',
                               ha='center', va='center', fontsize=6, color='gray')

    ax.set_xticks(range(4))
    ax.set_yticks(range(4))
    ax.set_xticklabels(['Col 0', 'Col 1', 'Col 2', 'Col 3'])
    ax.set_yticklabels(['Row 0', 'Row 1', 'Row 2', 'Row 3'])
    ax.set_title(title, fontsize=14, fontweight='bold')
    ax.grid(True, linewidth=1, color='lightgray')
    plt.tight_layout()
    plt.show()


visualize_q_table(agent.Q, "Learned Q-Table: FrozenLake (Slippery)")

## 5. Your Turn

### TODO: Implement Q-Learning with Different Exploration Strategies

Compare epsilon-greedy with a Boltzmann (softmax) exploration strategy.

In [None]:
class BoltzmannQLearningAgent:
    """Q-Learning agent with Boltzmann (softmax) exploration."""

    def __init__(self, n_states, n_actions, alpha=0.1, gamma=0.99,
                 temperature=1.0, temp_decay=0.995, temp_min=0.01):
        self.n_states = n_states
        self.n_actions = n_actions
        self.alpha = alpha
        self.gamma = gamma
        self.temperature = temperature
        self.temp_decay = temp_decay
        self.temp_min = temp_min
        self.Q = np.zeros((n_states, n_actions))

    def choose_action(self, state):
        """
        Boltzmann exploration: probability of choosing action a is proportional
        to exp(Q(s,a) / temperature).

        Higher temperature -> more random (like high epsilon)
        Lower temperature -> more greedy (like low epsilon)
        """
        # ============ TODO ============
        # Step 1: Compute logits = Q[state] / self.temperature
        # Step 2: For numerical stability, subtract max(logits)
        # Step 3: Compute exp_logits = np.exp(logits)
        # Step 4: Compute probabilities = exp_logits / sum(exp_logits)
        # Step 5: Return np.random.choice(self.n_actions, p=probabilities)
        # ==============================

        return 0  # YOUR CODE HERE

    def update(self, state, action, reward, next_state, done):
        """Same Q-Learning update as before."""
        if done:
            td_target = reward
        else:
            td_target = reward + self.gamma * np.max(self.Q[next_state])

        td_error = td_target - self.Q[state, action]
        self.Q[state, action] += self.alpha * td_error

    def decay_temperature(self):
        self.temperature = max(self.temp_min, self.temperature * self.temp_decay)

In [None]:
# Verification: Train and compare
boltzmann_agent = BoltzmannQLearningAgent(
    n_states=env.observation_space.n,
    n_actions=env.action_space.n,
)

boltz_rewards = []
for ep in range(10000):
    state, _ = env.reset()
    total_reward = 0
    done = False
    while not done:
        action = boltzmann_agent.choose_action(state)
        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated
        boltzmann_agent.update(state, action, reward, next_state, terminated)
        total_reward += reward
        state = next_state
    boltzmann_agent.decay_temperature()
    boltz_rewards.append(total_reward)

# Compare
window = 100
eps_avg = [np.mean(rewards[max(0,i-window):i+1]) for i in range(len(rewards))]
boltz_avg = [np.mean(boltz_rewards[max(0,i-window):i+1]) for i in range(len(boltz_rewards))]

plt.figure(figsize=(12, 5))
plt.plot(eps_avg, label='Epsilon-Greedy', color='#2171b5', alpha=0.8)
plt.plot(boltz_avg, label='Boltzmann', color='#d94701', alpha=0.8)
plt.xlabel('Episode', fontsize=12)
plt.ylabel('Success Rate (100-ep avg)', fontsize=12)
plt.title('Epsilon-Greedy vs Boltzmann Exploration', fontsize=14, fontweight='bold')
plt.legend(fontsize=12)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"Epsilon-greedy final: {np.mean(rewards[-100:]):.2%}")
print(f"Boltzmann final:      {np.mean(boltz_rewards[-100:]):.2%}")

## 6. Putting It All Together

Let us test the trained agent and watch it navigate FrozenLake.

In [None]:
def test_agent(env, agent, n_episodes=100):
    """Test the trained agent without exploration."""
    successes = 0
    trajectories = []

    for _ in range(n_episodes):
        state, _ = env.reset()
        trajectory = [state]
        done = False

        while not done:
            action = np.argmax(agent.Q[state])  # Greedy
            state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            trajectory.append(state)

            if reward > 0:
                successes += 1

        trajectories.append(trajectory)

    print(f"Test Results ({n_episodes} episodes):")
    print(f"  Success rate: {successes / n_episodes:.2%}")
    return trajectories


trajectories = test_agent(env, agent, n_episodes=1000)

## 7. Training and Results

Let us study the learning dynamics more carefully.

In [None]:
# TD error evolution
window = 500
td_moving_avg = [np.mean(agent.td_errors[max(0,i-window):i+1])
                 for i in range(len(agent.td_errors))]

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# TD errors
axes[0].plot(td_moving_avg[::100], color='#756bb1', linewidth=1)
axes[0].set_xlabel('Step (x100)', fontsize=11)
axes[0].set_ylabel('Mean |TD Error|', fontsize=11)
axes[0].set_title('TD Error Over Training', fontsize=13, fontweight='bold')
axes[0].grid(True, alpha=0.3)

# Q-value distribution
q_nonzero = agent.Q[agent.Q != 0]
axes[1].hist(q_nonzero, bins=30, color='#2171b5', alpha=0.7, edgecolor='white')
axes[1].set_xlabel('Q-value', fontsize=11)
axes[1].set_ylabel('Count', fontsize=11)
axes[1].set_title('Distribution of Learned Q-values', fontsize=13, fontweight='bold')
axes[1].grid(True, alpha=0.3)

# State visit frequency from test trajectories
visit_counts = np.zeros(16)
for traj in trajectories[:100]:
    for s in traj:
        visit_counts[s] += 1

visit_grid = visit_counts.reshape(4, 4)
im = axes[2].imshow(visit_grid, cmap='YlOrRd')
axes[2].set_title('State Visit Frequency (Test)', fontsize=13, fontweight='bold')
for r in range(4):
    for c in range(4):
        axes[2].text(c, r, f'{int(visit_grid[r,c])}', ha='center', va='center',
                    fontsize=11, fontweight='bold')
axes[2].set_xticks(range(4))
axes[2].set_yticks(range(4))
plt.colorbar(im, ax=axes[2], shrink=0.8)

plt.suptitle('Q-Learning Analysis', fontsize=15, fontweight='bold')
plt.tight_layout()
plt.show()

## 8. Final Output

In [None]:
# Print the learned policy as a visual map
print("=" * 50)
print("  LEARNED POLICY FOR FROZENLAKE")
print("=" * 50)
print()

action_arrows = ['<', 'v', '>', '^']
lake_map = [
    ['S', 'F', 'F', 'F'],
    ['F', 'H', 'F', 'H'],
    ['F', 'F', 'F', 'H'],
    ['H', 'F', 'F', 'G'],
]

for r in range(4):
    row = ""
    for c in range(4):
        state = r * 4 + c
        cell = lake_map[r][c]
        if cell == 'H':
            row += " [XX] "
        elif cell == 'G':
            row += " [GG] "
        else:
            best = np.argmax(agent.Q[state])
            row += f" [{action_arrows[best]} ] "
    print(row)

print()
print(f"Final success rate: {np.mean(rewards[-100:]):.1%}")
print(f"The agent learned to navigate the slippery frozen lake!")
print()

# Final comprehensive plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))

# Left: Q-table heatmap
im1 = ax1.imshow(agent.Q, cmap='RdYlBu', aspect='auto')
ax1.set_xlabel('Action (0=L, 1=D, 2=R, 3=U)', fontsize=11)
ax1.set_ylabel('State', fontsize=11)
ax1.set_title('Complete Q-Table', fontsize=13, fontweight='bold')
plt.colorbar(im1, ax=ax1)

# Right: Training curve
moving_avg = [np.mean(rewards[max(0,i-100):i+1]) for i in range(len(rewards))]
ax2.plot(moving_avg, color='#2171b5', linewidth=1.5)
ax2.fill_between(range(len(moving_avg)), moving_avg, alpha=0.15, color='#2171b5')
ax2.set_xlabel('Episode', fontsize=11)
ax2.set_ylabel('Success Rate', fontsize=11)
ax2.set_title('Learning Curve', fontsize=13, fontweight='bold')
ax2.set_ylim(0, 1)
ax2.grid(True, alpha=0.3)

plt.suptitle('Q-Learning on FrozenLake -- Complete Results', fontsize=15, fontweight='bold')
plt.tight_layout()
plt.show()

print("Congratulations! You have trained a Q-Learning agent from scratch!")

## 9. Reflection and Next Steps

### Reflection Questions
1. Q-Learning is called "off-policy." Why? What is the behavior policy, and what is the target policy?
2. The slippery environment makes FrozenLake hard. How would the success rate change with is_slippery=False?
3. What happens if alpha is too large (e.g., 0.9)? What if it is too small (e.g., 0.001)?

### Optional Challenges
1. Implement Double Q-Learning to reduce overestimation bias. Compare with vanilla Q-Learning.
2. Try the 8x8 FrozenLake (FrozenLake-v1 with map_name="8x8"). Does tabular Q-Learning still work? How many episodes does it need?
3. Implement SARSA (on-policy TD learning) and compare its behavior with Q-Learning on slippery FrozenLake.