# Q-Learning Algorithm

Welcome to the Q-Learning assignment! In this notebook, you'll implement one of the most fundamental algorithms in Reinforcement Learning. By the end of this assignment, you'll be able to:

* Understand the Q-Learning algorithm and its core update rule
* Implement the epsilon-greedy exploration strategy
* Build a complete Q-Learning agent from scratch
* Train and evaluate your agent on classic RL environments
* Visualize the learning progress

Q-Learning is an **off-policy** temporal difference (TD) control algorithm that learns the optimal action-value function Q*(s,a) directly, without requiring a model of the environment.

## The Q-Learning Update Rule

The core of Q-Learning is its update rule:

$$Q(s,a) \leftarrow Q(s,a) + \alpha \left[r + \gamma \max_{a'} Q(s',a') - Q(s,a)\right]$$

Where:
- $s$ = current state
- $a$ = action taken
- $r$ = reward received
- $s'$ = next state
- $\alpha$ = learning rate (how much to update)
- $\gamma$ = discount factor (how much to value future rewards)
- $\max_{a'} Q(s',a')$ = maximum Q-value for the next state

The term in brackets is called the **TD error** (temporal difference error).

<img src="https://miro.medium.com/max/1400/1*QeoQEqWYYPs1P8yUwyaJVQ.png" style="width:600px;height:300px;">

Let's get started!

## Important Note on Submission to the AutoGrader

Before submitting your assignment to the AutoGrader, please make sure you are not doing the following:

1. You have not added any _extra_ `print` statement(s) in the assignment.
2. You have not added any _extra_ code cell(s) in the assignment.
3. You have not changed any of the function parameters.
4. You are not using any global variables inside your graded exercises. Unless specifically instructed to do so, please refrain from it and use the local variables instead.
5. You are not changing the assignment code where it is not required, like creating _extra_ variables.

## Table of Contents
- [1 - Packages](#1)
- [2 - Q-Table Initialization](#2)
    - [Exercise 1 - initialize_q_table](#ex-1)
- [3 - Epsilon-Greedy Policy](#3)
    - [Exercise 2 - epsilon_greedy_action](#ex-2)
- [4 - Q-Learning Update](#4)
    - [Exercise 3 - q_learning_update](#ex-3)
- [5 - Training Loop](#5)
    - [Exercise 4 - train_q_learning](#ex-4)
- [6 - Testing the Complete Agent](#6)
    - [6.1 - Train on FrozenLake](#6-1)
    - [6.2 - Visualize Results](#6-2)
    - [6.3 - Evaluate Agent](#6-3)

<a name='1'></a>
## 1 - Packages

In [None]:
import numpy as np
import gymnasium as gym
import matplotlib.pyplot as plt
from collections import defaultdict
from q_learning_tests import *

%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 6.0)
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# Set random seed for reproducibility
np.random.seed(42)

<a name='2'></a>
## 2 - Q-Table Initialization

The Q-table stores the expected cumulative reward for each state-action pair. Initially, we don't know these values, so we initialize them.

There are several initialization strategies:
- **Zeros**: Conservative, slow to explore
- **Random small values**: Encourages initial exploration
- **Optimistic initialization**: Initialize with high values to encourage exploration

<a name='ex-1'></a>
### Exercise 1 - initialize_q_table

Implement a function that initializes the Q-table. For discrete environments, use a 2D numpy array of shape `(n_states, n_actions)`. For environments with many states, you might want to use a dictionary (defaultdict).

In [None]:
# GRADED FUNCTION: initialize_q_table

def initialize_q_table(n_states, n_actions, init_value=0.0):
    """
    Initialize Q-table with given value.
    
    Arguments:
    n_states -- number of states in the environment
    n_actions -- number of actions in the environment
    init_value -- initial value for Q-table entries (default: 0.0)
    
    Returns:
    Q -- numpy array of shape (n_states, n_actions) initialized with init_value
    """
    # (approx. 1 line)
    # Q = 
    # YOUR CODE STARTS HERE
    
    
    # YOUR CODE ENDS HERE
    
    return Q

In [None]:
# Test your implementation
Q = initialize_q_table(16, 4, init_value=0.0)
print("Q-table shape:", Q.shape)
print("Q-table sample:\n", Q[:4, :])

# Run the grader
initialize_q_table_test(initialize_q_table)

**Expected output:**
```
Q-table shape: (16, 4)
Q-table sample:
 [[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]
```

<a name='3'></a>
## 3 - Epsilon-Greedy Policy

The **exploration-exploitation dilemma** is fundamental in RL:
- **Exploitation**: Use current knowledge to maximize reward
- **Exploration**: Try new actions to discover better strategies

The **ε-greedy policy** balances these:
- With probability $\epsilon$: choose a random action (explore)
- With probability $1-\epsilon$: choose the best known action (exploit)

$$
a = \begin{cases}
\arg\max_{a'} Q(s, a') & \text{with probability } 1-\epsilon \\
\text{random action} & \text{with probability } \epsilon
\end{cases}
$$

Typically, we start with high $\epsilon$ (e.g., 1.0) and decay it over time.

<a name='ex-2'></a>
### Exercise 2 - epsilon_greedy_action

Implement the epsilon-greedy action selection.

In [None]:
# GRADED FUNCTION: epsilon_greedy_action

def epsilon_greedy_action(Q, state, n_actions, epsilon):
    """
    Select action using epsilon-greedy policy.
    
    Arguments:
    Q -- Q-table, numpy array of shape (n_states, n_actions)
    state -- current state (integer)
    n_actions -- number of possible actions
    epsilon -- exploration rate (0 to 1)
    
    Returns:
    action -- selected action (integer)
    """
    # (approx. 4-5 lines)
    # With probability epsilon, choose random action
    # Otherwise, choose the action with highest Q-value for current state
    # Hint: use np.random.random() to generate random number in [0,1)
    # Hint: use np.argmax() to find action with highest Q-value
    
    # YOUR CODE STARTS HERE
    
    
    
    
    # YOUR CODE ENDS HERE
    
    return action

In [None]:
# Test your implementation
np.random.seed(42)
Q_test = np.array([[0.1, 0.5, 0.2, 0.3],
                   [0.4, 0.1, 0.6, 0.2]])

# Test with epsilon=0 (always exploit)
action = epsilon_greedy_action(Q_test, state=0, n_actions=4, epsilon=0.0)
print(f"Epsilon=0.0, State=0, Action selected: {action} (should be 1, highest Q-value)")

# Test with epsilon=1 (always explore)
actions = [epsilon_greedy_action(Q_test, state=0, n_actions=4, epsilon=1.0) for _ in range(100)]
print(f"Epsilon=1.0: Actions are random: {len(set(actions)) > 1}")

# Run the grader
epsilon_greedy_action_test(epsilon_greedy_action)

<a name='4'></a>
## 4 - Q-Learning Update

Now we implement the core Q-Learning update rule. After taking action $a$ in state $s$ and observing reward $r$ and next state $s'$, we update:

$$Q(s,a) \leftarrow Q(s,a) + \alpha \left[r + \gamma \max_{a'} Q(s',a') - Q(s,a)\right]$$

Let's break this down:
1. **Current estimate**: $Q(s,a)$
2. **TD target**: $r + \gamma \max_{a'} Q(s',a')$ (immediate reward + discounted best future value)
3. **TD error**: $r + \gamma \max_{a'} Q(s',a') - Q(s,a)$ (difference between target and current)
4. **Update**: Move Q(s,a) toward the target by a fraction $\alpha$

**Special case**: If $s'$ is a terminal state (episode ended), there's no future reward, so:
$$Q(s,a) \leftarrow Q(s,a) + \alpha \left[r - Q(s,a)\right]$$

<a name='ex-3'></a>
### Exercise 3 - q_learning_update

Implement the Q-Learning update rule.

In [None]:
# GRADED FUNCTION: q_learning_update

def q_learning_update(Q, state, action, reward, next_state, done, alpha, gamma):
    """
    Update Q-table using Q-Learning rule.
    
    Arguments:
    Q -- Q-table, numpy array of shape (n_states, n_actions)
    state -- current state
    action -- action taken
    reward -- reward received
    next_state -- next state after taking action
    done -- boolean, True if next_state is terminal
    alpha -- learning rate
    gamma -- discount factor
    
    Returns:
    Q -- updated Q-table
    td_error -- TD error (for tracking learning progress)
    """
    # (approx. 5-7 lines)
    # Step 1: Get current Q-value
    # Step 2: Calculate TD target
    #         If done: target = reward
    #         Else: target = reward + gamma * max Q-value of next_state
    # Step 3: Calculate TD error = target - current Q-value
    # Step 4: Update Q-value: Q[state, action] += alpha * td_error
    
    # YOUR CODE STARTS HERE
    
    
    
    
    
    
    # YOUR CODE ENDS HERE
    
    return Q, td_error

In [None]:
# Test your implementation
Q_test = np.zeros((4, 2))
Q_test[1] = [0.5, 0.3]  # Q-values for next_state

# Test non-terminal update
Q_updated, td_error = q_learning_update(
    Q_test.copy(), state=0, action=0, reward=1.0, 
    next_state=1, done=False, alpha=0.1, gamma=0.9
)
print(f"Non-terminal update:")
print(f"  Q[0,0] before: 0.0")
print(f"  Q[0,0] after: {Q_updated[0, 0]:.4f}")
print(f"  TD error: {td_error:.4f}")

# Test terminal update
Q_updated, td_error = q_learning_update(
    Q_test.copy(), state=0, action=0, reward=1.0,
    next_state=1, done=True, alpha=0.1, gamma=0.9
)
print(f"\nTerminal update:")
print(f"  Q[0,0] after: {Q_updated[0, 0]:.4f}")
print(f"  TD error: {td_error:.4f}")

# Run the grader
q_learning_update_test(q_learning_update)

**Expected output (approximately):**
```
Non-terminal update:
  Q[0,0] before: 0.0
  Q[0,0] after: 0.1450
  TD error: 1.4500

Terminal update:
  Q[0,0] after: 0.1000
  TD error: 1.0000
```

<a name='5'></a>
## 5 - Training Loop

Now let's put it all together! The training loop follows this structure:

```
For each episode:
    Initialize state s
    For each step in episode:
        Choose action a using ε-greedy policy
        Take action a, observe reward r and next state s'
        Update Q(s,a) using Q-Learning rule
        s ← s'
    Decay ε
```

<a name='ex-4'></a>
### Exercise 4 - train_q_learning

Implement the complete Q-Learning training loop.

In [None]:
# GRADED FUNCTION: train_q_learning

def train_q_learning(env, n_episodes=1000, alpha=0.1, gamma=0.99, 
                     epsilon=1.0, epsilon_decay=0.995, epsilon_min=0.01,
                     max_steps=100):
    """
    Train Q-Learning agent.
    
    Arguments:
    env -- Gymnasium environment
    n_episodes -- number of episodes to train
    alpha -- learning rate
    gamma -- discount factor
    epsilon -- initial exploration rate
    epsilon_decay -- decay rate for epsilon after each episode
    epsilon_min -- minimum value for epsilon
    max_steps -- maximum steps per episode
    
    Returns:
    Q -- trained Q-table
    rewards_history -- list of total rewards per episode
    epsilon_history -- list of epsilon values per episode
    """
    # Initialize Q-table
    n_states = env.observation_space.n
    n_actions = env.action_space.n
    Q = initialize_q_table(n_states, n_actions)
    
    rewards_history = []
    epsilon_history = []
    
    # Training loop (approx. 20-25 lines)
    # For each episode:
    #   1. Reset environment to get initial state
    #   2. Initialize episode_reward = 0
    #   3. For each step (up to max_steps):
    #      a. Select action using epsilon_greedy_action
    #      b. Take action in environment (env.step)
    #      c. Update Q-table using q_learning_update
    #      d. Add reward to episode_reward
    #      e. Update state
    #      f. If done, break
    #   4. Decay epsilon: epsilon = max(epsilon_min, epsilon * epsilon_decay)
    #   5. Store episode_reward and epsilon in histories
    
    # YOUR CODE STARTS HERE
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    # YOUR CODE ENDS HERE
    
    return Q, rewards_history, epsilon_history

<a name='6'></a>
## 6 - Testing the Complete Agent

Now let's test your complete Q-Learning implementation on the FrozenLake environment!

**FrozenLake Environment:**
- 4x4 grid world
- Start at top-left (S)
- Goal: reach bottom-right (G)
- Holes (H) cause failure
- Frozen tiles (F) are safe

```
SFFF
FHFH
FFFH
HFFG
```

<a name='6-1'></a>
### 6.1 - Train on FrozenLake

In [None]:
# Create environment (is_slippery=False makes it deterministic)
env = gym.make('FrozenLake-v1', is_slippery=False)

print("Training Q-Learning agent on FrozenLake...")
print(f"States: {env.observation_space.n}")
print(f"Actions: {env.action_space.n}")
print()

# Train the agent
Q, rewards_history, epsilon_history = train_q_learning(
    env,
    n_episodes=2000,
    alpha=0.1,
    gamma=0.99,
    epsilon=1.0,
    epsilon_decay=0.995,
    epsilon_min=0.01,
    max_steps=100
)

print("Training completed!")
print(f"Final epsilon: {epsilon_history[-1]:.4f}")
print(f"Average reward (last 100 episodes): {np.mean(rewards_history[-100:]):.4f}")

<a name='6-2'></a>
### 6.2 - Visualize Results

In [None]:
# Plot training progress
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Rewards
ax1.plot(rewards_history, alpha=0.3, label='Episode reward')
window = 50
if len(rewards_history) >= window:
    moving_avg = np.convolve(rewards_history, np.ones(window)/window, mode='valid')
    ax1.plot(range(window-1, len(rewards_history)), moving_avg, 
             label=f'Moving average ({window} episodes)', linewidth=2)
ax1.set_xlabel('Episode')
ax1.set_ylabel('Total Reward')
ax1.set_title('Training Progress')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Epsilon decay
ax2.plot(epsilon_history)
ax2.set_xlabel('Episode')
ax2.set_ylabel('Epsilon (ε)')
ax2.set_title('Exploration Rate Decay')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Visualize learned Q-values
print("\nLearned Q-table (first 8 states):")
print(Q[:8])

<a name='6-3'></a>
### 6.3 - Evaluate Agent

In [None]:
def evaluate_agent(env, Q, n_episodes=100):
    """
    Evaluate trained agent (no exploration).
    """
    total_rewards = []
    
    for _ in range(n_episodes):
        state, _ = env.reset()
        episode_reward = 0
        done = False
        
        while not done:
            # Always exploit (epsilon=0)
            action = np.argmax(Q[state])
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            episode_reward += reward
            state = next_state
        
        total_rewards.append(episode_reward)
    
    return np.mean(total_rewards), np.std(total_rewards)

# Evaluate
mean_reward, std_reward = evaluate_agent(env, Q, n_episodes=100)
print(f"\nEvaluation over 100 episodes:")
print(f"Average reward: {mean_reward:.4f} ± {std_reward:.4f}")
print(f"Success rate: {mean_reward * 100:.1f}%")

env.close()

## Congratulations!

You've successfully implemented Q-Learning from scratch! Here's what you've learned:

✅ How to initialize and maintain a Q-table

✅ How to balance exploration and exploitation with ε-greedy

✅ How to apply the Q-Learning update rule

✅ How to train a complete RL agent

✅ How to evaluate and visualize learning progress

### Key Takeaways:

1. **Q-Learning is off-policy**: It learns the optimal policy regardless of the exploration policy used
2. **Exploration is crucial**: Without proper exploration (ε-greedy), the agent might not find the optimal policy
3. **Hyperparameters matter**: α (learning rate), γ (discount), and ε (exploration) significantly affect learning
4. **Convergence takes time**: Q-Learning needs many episodes to converge, especially with high exploration

### Next Steps:

- Try Q-Learning on other environments (CartPole, Taxi, etc.)
- Experiment with different hyperparameters
- Compare Q-Learning with SARSA (on-policy alternative)
- Learn about Deep Q-Networks (DQN) for continuous state spaces