# **DQN Model Theory**


## Deep Q-Network (DQN)

---

## Theory
Deep Q-Network (DQN) is a model-free, off-policy reinforcement learning algorithm that combines Q-learning with deep neural networks. It is designed to learn the optimal action-selection policy for an agent interacting with an environment by approximating the Q-value function using a deep neural network. DQN is particularly effective for environments with large state-action spaces where traditional Q-learning would be infeasible.

The main idea is to:
- Use a deep neural network to approximate the Q-value function.
- Utilize experience replay to store and sample past experiences.
- Employ a target network to stabilize the training process.

---

## Mathematical Foundation
- **Q-value Function**:
  The Q-value function \( Q(s, a) \) represents the expected cumulative reward of taking action \( a \) in state \( s \) and following the optimal policy thereafter.

- **Bellman Equation**:
  The Q-value is updated using the Bellman equation:
  $$ Q(s, a) \leftarrow Q(s, a) + \alpha \left( r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right) $$
  where:
  - \( s \) is the current state.
  - \( a \) is the current action.
  - \( r \) is the reward received after taking action \( a \).
  - \( s' \) is the next state.
  - \( \alpha \) is the learning rate (0 < \( \alpha \) ≤ 1).
  - \( \gamma \) is the discount factor (0 ≤ \( \gamma \) ≤ 1).

- **Deep Neural Network**:
  A neural network is used to approximate the Q-value function. The network takes the state as input and outputs Q-values for all possible actions.

- **Experience Replay**:
  A replay buffer is used to store past experiences (state, action, reward, next state, done). During training, random samples from the buffer are used to update the Q-network, breaking the correlation between consecutive experiences and improving training stability.

- **Target Network**:
  A separate target network is used to compute the target Q-values. The target network is periodically updated to match the weights of the main Q-network, reducing the risk of divergence and improving training stability.

---

## Algorithm Steps
1. **Initialize Q-network and Target Network**:
   - Initialize the Q-network with random weights.
   - Initialize the target network with the same weights as the Q-network.

2. **Initialize Replay Buffer**:
   - Create a replay buffer to store past experiences.

3. **Repeat for each episode**:
   - Initialize the starting state.

4. **Repeat for each step of the episode**:
   - Choose an action \( a \) based on the exploration strategy (e.g., \( \epsilon \)-greedy).
   - Take action \( a \), observe the reward \( r \) and the next state \( s' \).
   - Store the experience (state, action, reward, next state, done) in the replay buffer.
   - Sample a mini-batch of experiences from the replay buffer.
   - Compute the target Q-value for each experience using the target network:
     $$ y = r + \gamma \max_{a'} Q_{\text{target}}(s', a') $$
   - Update the Q-network by minimizing the loss between the predicted Q-value and the target Q-value:
     $$ L = \left( y - Q(s, a) \right)^2 $$
   - Periodically update the target network to match the Q-network.

5. **End of Episode**:
   - If the episode ends (e.g., the agent reaches a terminal state), reset the environment.

---

## Key Components
- **Q-network**: A neural network that approximates the Q-value function.
- **Target Network**: A copy of the Q-network used to compute target Q-values.
- **Replay Buffer**: A buffer that stores past experiences for training.
- **Exploration Strategy**: A strategy to balance exploration and exploitation (e.g., \( \epsilon \)-greedy).

---

## Advantages
- Effective for environments with large state-action spaces.
- Utilizes deep learning to approximate the Q-value function.
- Stabilizes training using experience replay and target networks.

---

## Disadvantages
- Requires significant computational resources for training.
- Can be unstable if hyperparameters are not tuned properly.
- Sensitive to the choice of neural network architecture and learning parameters.

---

## Implementation Tips
- Use **experience replay** to improve sample efficiency and training stability.
- Employ **target networks** to reduce the risk of divergence and stabilize training.
- Use **exploration strategies** like \( \epsilon \)-greedy or decayed \( \epsilon \)-greedy to balance exploration and exploitation.
- Normalize rewards to ensure stable learning.
- Regularly monitor and adjust hyperparameters to optimize performance.

---

## Applications
- Game playing (e.g., Atari games)
- Robotics (e.g., navigation, manipulation)
- Autonomous driving
- Financial trading
- Industrial automation

Deep Q-Network (DQN) is a powerful reinforcement learning algorithm that leverages deep learning to handle complex environments with large state-action spaces. Its ability to learn from raw sensory input makes it a valuable tool for many real-world applications.


## Model Evaluation for Deep Q-Network (DQN)

### 1. Q-values

**Description:**
- Evaluate the learned Q-values to ensure they converge to the optimal values.

**Interpretation:**
- Higher Q-values indicate better expected rewards for the corresponding state-action pairs.
- Compare Q-values across different episodes to observe convergence.

---

### 2. Cumulative Reward

**Formula:**
$$
\text{Cumulative Reward} = \sum_{t=0}^T r_t
$$
where:
- \( r_t \) is the reward received at time step \( t \).
- \( T \) is the total number of time steps.

**Description:**
- Measures the total reward accumulated by the agent over an episode.

**Interpretation:**
- Higher cumulative rewards indicate better performance.
- Plot cumulative rewards across episodes to observe learning progress.

---

### 3. Convergence

**Description:**
- Evaluate the convergence of the Q-values and cumulative rewards over time.

**Interpretation:**
- Convergence indicates that the agent has learned a stable policy.
- Plot Q-values and cumulative rewards across episodes to check for convergence.

---

### 4. Policy Evaluation

**Description:**
- Evaluate the learned policy by observing the actions taken by the agent.

**Interpretation:**
- A good policy should result in optimal actions being taken in each state.
- Compare the learned policy with the optimal policy (if known) to assess performance.

---

### 5. Exploration vs. Exploitation

**Description:**
- Evaluate the balance between exploration and exploitation during training.

**Interpretation:**
- A good balance ensures that the agent explores the environment sufficiently while also exploiting the learned policy.
- Plot the exploration rate (\(\epsilon\)) over time to observe the balance.

---

### 6. Learning Rate (\(\alpha\))

**Description:**
- Evaluate the impact of the learning rate on the agent's performance.

**Interpretation:**
- A higher learning rate may lead to faster convergence but can cause instability.
- A lower learning rate may result in slower convergence but more stable learning.
- Experiment with different learning rates to find the optimal value.

---

### 7. Discount Factor (\(\gamma\))

**Description:**
- Evaluate the impact of the discount factor on the agent's performance.

**Interpretation:**
- A higher discount factor places more importance on future rewards.
- A lower discount factor places more importance on immediate rewards.
- Experiment with different discount factors to find the optimal value.

---

### 8. Training Time

**Description:**
- Measure the time taken to train the agent.

**Interpretation:**
- Lower training times indicate more efficient learning.
- Evaluate training time in conjunction with other metrics to assess overall performance.

---

### 9. Success Rate

**Description:**
- Measure the success rate of the agent in achieving its goal.

**Interpretation:**
- Higher success rates indicate better performance.
- Plot success rates across episodes to observe learning progress.

---

### 10. Episode Length

**Description:**
- Measure the average length of episodes.

**Interpretation:**
- Shorter episodes may indicate better performance (e.g., reaching the goal faster).
- Plot episode lengths across episodes to observe learning progress.

---


## Deep Q-Network (DQN)

### TensorFlow/Keras Implementation

| **Parameter**          | **Description**                                                                 |
|------------------------|-------------------------------------------------------------------------------|
| state_size             | The size of the state space.                                                   |
| action_size            | The size of the action space.                                                  |
| learning_rate          | The learning rate for the Q-network.                                           |
| gamma                  | Discount factor for future rewards (0 ≤ gamma ≤ 1).                            |
| epsilon                | Exploration rate for the epsilon-greedy policy.                                |
| epsilon_decay          | Decay rate for the exploration rate.                                           |
| epsilon_min            | Minimum value of epsilon.                                                      |
| batch_size             | Size of the mini-batch used for training.                                       |
| memory_capacity        | Capacity of the replay buffer.                                                 |
| target_update_freq     | Frequency of updating the target network.                                      |

---

| **Attribute**          | **Description**                                                                 |
|------------------------|-------------------------------------------------------------------------------|
| q_network              | Neural network model to approximate the Q-value function.                      |
| target_network         | Neural network model to provide stable target Q-values.                        |
| replay_buffer          | Buffer to store past experiences for experience replay.                        |

---

| **Method**             | **Description**                                                                 |
|------------------------|-------------------------------------------------------------------------------|
| build_model()          | Build the Q-network and target network models.                                 |
| remember(state, action, reward, next_state, done) | Store experiences in the replay buffer.                |
| choose_action(state)   | Choose an action based on the epsilon-greedy policy.                           |
| replay()               | Train the Q-network using mini-batches from the replay buffer.                 |
| update_target_network()| Update the weights of the target network to match the Q-network.               |
| train(env, num_episodes)| Train the DQN agent in the given environment.                                  |
| evaluate(env, num_episodes)| Evaluate the performance of the DQN agent in the environment.               |

---

| **Usage Example (Python)**  | **Code**                                                                 |
|-----------------------------|------------------------------------------------------------------------|
| **Initialize DQN agent**    | `agent = DQNAgent(state_size, action_size, learning_rate=0.001, gamma=0.99, epsilon=1.0, epsilon_decay=0.995, epsilon_min=0.01, batch_size=64, memory_capacity=100000, target_update_freq=10)` |
| **Train the agent**         | `agent.train(env, num_episodes=1000)`                                   |
| **Evaluate the agent**      | `agent.evaluate(env, num_episodes=100)`                                 |
| **Choose an action**        | `action = agent.choose_action(state)`                                   |
| **Store experiences**       | `agent.remember(state, action, reward, next_state, done)`               |

---


# XXXXXXXX regression - Example

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow.keras import models, layers, optimizers
import random
from collections import deque
import gym
import matplotlib.pyplot as plt

# Data Loading
env = gym.make("CartPole-v1")
state_size = env.observation_space.shape[0]
action_size = env.action_space.n

# Parameters
learning_rate = 0.001
gamma = 0.99
epsilon = 1.0
epsilon_decay = 0.995
epsilon_min = 0.01
batch_size = 64
memory_capacity = 100000
target_update_freq = 10
num_episodes = 1000

# Model Definition
class DQNAgent:
    def __init__(self, state_size, action_size):
        self.state_size = state_size
        self.action_size = action_size
        self.memory = deque(maxlen=memory_capacity)
        self.q_network = self.build_model()
        self.target_network = self.build_model()
        self.update_target_network()
        self.epsilon = epsilon

    def build_model(self):
        model = models.Sequential()
        model.add(layers.Dense(24, input_dim=self.state_size, activation='relu'))
        model.add(layers.Dense(24, activation='relu'))
        model.add(layers.Dense(self.action_size, activation='linear'))
        model.compile(optimizer=optimizers.Adam(learning_rate=learning_rate), loss='mse')
        return model

    def remember(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))

    def choose_action(self, state):
        if np.random.rand() <= self.epsilon:
            return random.randrange(self.action_size)
        q_values = self.q_network.predict(state)
        return np.argmax(q_values[0])

    def replay(self):
        if len(self.memory) < batch_size:
            return
        mini_batch = random.sample(self.memory, batch_size)
        for state, action, reward, next_state, done in mini_batch:
            target = reward
            if not done:
                target += gamma * np.amax(self.target_network.predict(next_state)[0])
            target_f = self.q_network.predict(state)
            target_f[0][action] = target
            self.q_network.fit(state, target_f, epochs=1, verbose=0)
        if self.epsilon > epsilon_min:
            self.epsilon *= epsilon_decay

    def update_target_network(self):
        self.target_network.set_weights(self.q_network.get_weights())

    def train(self):
        rewards = []
        for e in range(num_episodes):
            state = env.reset()
            state = np.reshape(state, [1, self.state_size])
            total_rewards = 0
            for time in range(500):
                action = self.choose_action(state)
                next_state, reward, done, _ = env.step(action)
                next_state = np.reshape(next_state, [1, self.state_size])
                self.remember(state, action, reward, next_state, done)
                state = next_state
                total_rewards += reward
                if done:
                    self.update_target_network()
                    print(f"Episode: {e+1}/{num_episodes}, Score: {time}, Epsilon: {self.epsilon:.2}")
                    break
                self.replay()
                if e % target_update_freq == 0:
                    self.update_target_network()
            rewards.append(total_rewards)
        return rewards

    def evaluate(self, num_eval_episodes=100):
        total_eval_rewards = 0
        for e in range(num_eval_episodes):
            state = env.reset()
            state = np.reshape(state, [1, self.state_size])
            episode_rewards = 0
            for time in range(500):
                action = np.argmax(self.q_network.predict(state)[0])
                next_state, reward, done, _ = env.step(action)
                next_state = np.reshape(next_state, [1, self.state_size])
                episode_rewards += reward
                state = next_state
                if done:
                    break
            total_eval_rewards += episode_rewards
        avg_eval_reward = total_eval_rewards / num_eval_episodes
        print(f"Average Evaluation Reward: {avg_eval_reward}")
        return avg_eval_reward

# Initialize and train the DQN agent
agent = DQNAgent(state_size, action_size)
rewards = agent.train()

# Data Plotting
plt.plot(range(num_episodes), rewards)
plt.xlabel('Episodes')
plt.ylabel('Rewards')
plt.title('Training Rewards over Episodes')
plt.show()

# Evaluate the DQN agent
agent.evaluate()


2025-02-10 08:28:30.245283: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-02-10 08:28:30.529704: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-02-10 08:28:30.792507: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1739176111.038985   18018 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1739176111.106314   18018 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-02-10 08:28:31.657131: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU ins

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.