# **Q-Learning Model Theory**


## Q-learning

---

## Theory
Q-learning is a model-free reinforcement learning algorithm used to find the optimal action-selection policy for an agent interacting with an environment. It is designed to learn the value of taking a specific action in a specific state, which is represented by a Q-value. Q-learning uses a Q-table to store the Q-values for state-action pairs and iteratively updates these values based on the agent's experiences.

The main idea is to:
- Initialize a Q-table with arbitrary values.
- Use an exploration strategy to interact with the environment.
- Update the Q-values based on the received rewards.
- Gradually converge to the optimal Q-values that represent the best action-selection policy.

---

## Mathematical Foundation
- **Q-value**:
  The Q-value \( Q(s, a) \) represents the expected cumulative reward of taking action \( a \) in state \( s \) and following the optimal policy thereafter.

- **Bellman Equation**:
  The Q-value is updated using the Bellman equation:
  $$ Q(s, a) \leftarrow Q(s, a) + \alpha \left( r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right) $$
  where:
  - \( s \) is the current state.
  - \( a \) is the current action.
  - \( r \) is the reward received after taking action \( a \).
  - \( s' \) is the next state.
  - \( \alpha \) is the learning rate (0 < \( \alpha \) ≤ 1).
  - \( \gamma \) is the discount factor (0 ≤ \( \gamma \) ≤ 1).

- **Q-table**:
  A table that stores the Q-values for all state-action pairs. The table is updated iteratively based on the agent's experiences.

- **Exploration vs. Exploitation**:
  The agent balances exploration (trying new actions) and exploitation (using the best-known actions) to learn the optimal policy:
  - **Exploration**: Choose a random action with probability \( \epsilon \).
  - **Exploitation**: Choose the action with the highest Q-value with probability \(  1- \epsilon \).

---

## Algorithm Steps
1. **Initialize Q-table**:
   - Initialize the Q-table with arbitrary values for all state-action pairs.

2. **Repeat for each episode**:
   - Initialize the starting state.

3. **Repeat for each step of the episode**:
   - Choose an action \( a \) based on the exploration strategy (e.g., \( \epsilon \)-greedy).
   - Take action \( a \), observe the reward \( r \) and the next state \( s' \).
   - Update the Q-value using the Bellman equation.
   - Update the current state to the next state \( s' \).

4. **End of Episode**:
   - If the episode ends (e.g., the agent reaches a terminal state), reset the environment.

---

## Key Parameters
- **Learning Rate (\(\alpha\))**: Determines the extent to which new information overrides old information.
- **Discount Factor (\(\gamma\))**: Determines the importance of future rewards.
- **Exploration Rate (\(\epsilon\))**: Determines the probability of choosing a random action.

---

## Advantages
- Simple and easy to implement.
- Model-free, does not require knowledge of the environment's dynamics.
- Can handle stochastic environments.

---

## Disadvantages
- Can be slow to converge for large state-action spaces.
- Requires careful tuning of hyperparameters.
- May not perform well in environments with sparse rewards.

---

## Implementation Tips
- Use **exploration strategies** like \( \epsilon \)-greedy, decayed \( \epsilon \)-greedy, or softmax to balance exploration and exploitation.
- Normalize rewards to ensure stable learning.
- Use **experience replay** to improve sample efficiency and convergence.

---

## Applications
- Game playing (e.g., chess, Go)
- Robotics (e.g., navigation, manipulation)
- Autonomous driving
- Recommendation systems

Q-learning is a powerful reinforcement learning algorithm that helps agents learn optimal policies through interactions with the environment. Its simplicity and effectiveness make it a popular choice for various applications.


## Model Evaluation for Q-learning

### 1. Q-values

**Description:**
- Evaluate the learned Q-values to ensure they converge to the optimal values.

**Interpretation:**
- Higher Q-values indicate better expected rewards for the corresponding state-action pairs.
- Compare Q-values across different episodes to observe convergence.

---

### 2. Cumulative Reward

**Formula:**
$$
\text{Cumulative Reward} = \sum_{t=0}^T r_t
$$
where:
- \( r_t \) is the reward received at time step \( t \).
- \( T \) is the total number of time steps.

**Description:**
- Measures the total reward accumulated by the agent over an episode.

**Interpretation:**
- Higher cumulative rewards indicate better performance.
- Plot cumulative rewards across episodes to observe learning progress.

---

### 3. Convergence

**Description:**
- Evaluate the convergence of the Q-values and cumulative rewards over time.

**Interpretation:**
- Convergence indicates that the agent has learned a stable policy.
- Plot Q-values and cumulative rewards across episodes to check for convergence.

---

### 4. Policy Evaluation

**Description:**
- Evaluate the learned policy by observing the actions taken by the agent.

**Interpretation:**
- A good policy should result in optimal actions being taken in each state.
- Compare the learned policy with the optimal policy (if known) to assess performance.

---

### 5. Exploration vs. Exploitation

**Description:**
- Evaluate the balance between exploration and exploitation during training.

**Interpretation:**
- A good balance ensures that the agent explores the environment sufficiently while also exploiting the learned policy.
- Plot the exploration rate (\(\epsilon\)) over time to observe the balance.

---

### 6. Learning Rate (\(\alpha\))

**Description:**
- Evaluate the impact of the learning rate on the agent's performance.

**Interpretation:**
- A higher learning rate may lead to faster convergence but can cause instability.
- A lower learning rate may result in slower convergence but more stable learning.
- Experiment with different learning rates to find the optimal value.

---

### 7. Discount Factor (\(\gamma\))

**Description:**
- Evaluate the impact of the discount factor on the agent's performance.

**Interpretation:**
- A higher discount factor places more importance on future rewards.
- A lower discount factor places more importance on immediate rewards.
- Experiment with different discount factors to find the optimal value.

---

### 8. Training Time

**Description:**
- Measure the time taken to train the agent.

**Interpretation:**
- Lower training times indicate more efficient learning.
- Evaluate training time in conjunction with other metrics to assess overall performance.

---

### 9. Success Rate

**Description:**
- Measure the success rate of the agent in achieving its goal.

**Interpretation:**
- Higher success rates indicate better performance.
- Plot success rates across episodes to observe learning progress.

---

### 10. Episode Length

**Description:**
- Measure the average length of episodes.

**Interpretation:**
- Shorter episodes may indicate better performance (e.g., reaching the goal faster).
- Plot episode lengths across episodes to observe learning progress.

---


## Q-learning

### Custom Implementation

| **Parameter**   | **Description**                                                                 |
|-----------------|-------------------------------------------------------------------------------|
| alpha           | Learning rate (0 < alpha ≤ 1).                                                |
| gamma           | Discount factor (0 ≤ gamma ≤ 1).                                              |
| epsilon         | Exploration rate (probability of choosing a random action).                   |
| epsilon_decay   | Decay rate for the exploration rate.                                          |
| num_episodes    | Number of training episodes.                                                  |
| max_steps       | Maximum number of steps per episode.                                          |

-

| **Attribute**           | **Description**                                                                 |
|-------------------------|-------------------------------------------------------------------------------|
| Q-table                 | A table storing Q-values for state-action pairs.                             |
| policy                  | The learned policy (mapping from states to actions).                        |

-

| **Method**              | **Description**                                                                 |
|-------------------------|-------------------------------------------------------------------------------|
| choose_action(state)    | Choose an action based on the exploration strategy.                           |
| update_q_table(state, action, reward, next_state) | Update the Q-table using the Bellman equation.         |
| train(env)              | Train the Q-learning agent in the given environment.                         |
| evaluate(env)           | Evaluate the performance of the Q-learning agent in the environment.          |
| decay_epsilon()         | Decay the exploration rate over time.                                         |
| reset()                 | Reset the Q-learning agent for a new training run.                            |

-

| **Usage Example (Python)**  | **Code**                                                                 |
|-----------------------------|------------------------------------------------------------------------|
| **Initialize Q-learning agent** | `agent = QLearningAgent(alpha=0.1, gamma=0.99, epsilon=1.0, epsilon_decay=0.99, num_episodes=1000, max_steps=100)` |
| **Train the agent** | `agent.train(env)` |
| **Evaluate the agent** | `agent.evaluate(env)` |
| **Choose an action** | `action = agent.choose_action(state)` |
| **Update Q-table** | `agent.update_q_table(state, action, reward, next_state)` |

---




# XXXXXXXX regression - Example

In [5]:
import numpy as np
import gym
import matplotlib.pyplot as plt

# Data Loading
env = gym.make("Taxi-v3")
state_size = env.observation_space.n
action_size = env.action_space.n

# Data Preprocessing
q_table = np.zeros((state_size, action_size))

# Parameters
alpha = 0.1  # Learning rate
gamma = 0.99  # Discount factor
epsilon = 1.0  # Exploration rate
epsilon_decay = 0.99  # Decay rate for exploration
num_episodes = 1000  # Number of training episodes
max_steps = 100  # Maximum steps per episode

# Training variables
rewards = []

# Model Definition and Training
for episode in range(num_episodes):
    state = env.reset()
    total_rewards = 0
    for step in range(max_steps):
        # Choose action (exploration vs. exploitation)
        if np.random.uniform(0, 1) < epsilon:
            action = env.action_space.sample()
        else:
            action = np.argmax(q_table[state, :])

        # Take action and observe the outcome
        next_state, reward, terminated, truncated, info = env.step(action)
        done = terminated or truncated

        # Update Q-table using the Bellman equation
        best_next_action = np.argmax(q_table[next_state, :])
        q_table[state, action] += alpha * (reward + gamma * q_table[next_state, best_next_action] - q_table[state, action])

        # Move to the next state
        state = next_state
        total_rewards += reward

        if done:
            break

    # Decay the exploration rate
    epsilon *= epsilon_decay
    rewards.append(total_rewards)

# Data Plotting
plt.plot(range(num_episodes), rewards)
plt.xlabel('Episodes')
plt.ylabel('Rewards')
plt.title('Training Rewards over Episodes')
plt.show()

# Model Evaluation
num_eval_episodes = 100
total_eval_rewards = 0

for episode in range(num_eval_episodes):
    state = env.reset()
    episode_rewards = 0
    for step in range(max_steps):
        action = np.argmax(q_table[state, :])  # Exploitation
        next_state, reward, terminated, truncated, info = env.step(action)
        done = terminated or truncated
        episode_rewards += reward
        state = next_state
        if done:
            break
    total_eval_rewards += episode_rewards

avg_eval_reward = total_eval_rewards / num_eval_episodes
print(f"Average Evaluation Reward: {avg_eval_reward}")


AttributeError: module 'numpy' has no attribute 'bool8'