# Homework 3 Function-based RL
Ponwalai Chalermwattanatrai 65340500042

## Part 1: Understanding the Algorithm

For each algorithm, describe whether it follows a value-based, policy-based, or Actor-Critic approach, specify the type of policy it learns (stochastic or deterministic), identify the type of observation space and action space (discrete or continuous), and explain how each advanced RL method balances exploration and exploitation.

#### 1.Linear Q-Learning

- **Approach**: Value-based
- **Policy type**: Deterministic (Stochastic during training)
    - Always chooses the action with the maximum Q-value (argmax Q(s, a)) -> Deterministic
    - During training, using epsilon-greedy, which includes some random -> Stochastic
- **Observation space**: continuous
- **Action space**: discrete
- **Balances exploration and exploitation**: using epsilon-greedy
    - Selects a random action with probability epsilon and decay epsilon over time

**Concept of algorithm**

Linear Q-learning is a Q-learning algorithm that use function approximation instead of a Q-table. It stores a weight matrix, where each column corresponds to the weights for a particular action. The size of the weight matrix is [state_size x num_actions].

In Linear- Q learning Q-value is linear function calculated using the dot product between the state vector and the weight vector for the selected action:

$$Q(s, a) = \mathbf{w}_a^\top \mathbf{s}$$

After get action and apply to environment, to update the weights, we calculate the Temporal Difference (TD) error using the Bellman equation:

$$\delta = r + \gamma \max_{a'} Q(s', a') - Q(s, a)$$

and define loss function as the mean squared error of TD error (1/2 is Constant that add for easier differential equation which is not impact to value):

$$\mathcal{L} = \frac{1}{2} \delta^2$$

To minimize this loss, we apply gradient descent. The gradient of the loss with respect to the weights is:

$$\nabla_{\mathbf{w}_a} \mathcal{L} = -\delta \cdot \mathbf{s}$$

Therefore, the weights are updated using the following rule:

$$\mathbf{w}_a \leftarrow \mathbf{w}_a + \alpha \cdot \delta \cdot \mathbf{s}$$

Which can be written as:

$$\boxed{
\mathbf{w}_a^{\text{new}} = \mathbf{w}_a^{\text{old}} + \alpha \left( r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right) \cdot \mathbf{s}
}$$

**Psuedo code of training Linear Q learning:**

#### 2. DQN (Deep Q-Learning)

- **Approach**: Value-based
- **Policy type**: Deterministic (Stochastic during training)
    - Always chooses the action with the maximum Q-value (argmax Q(s, a)) -> Deterministic
    - During training, using epsilon-greedy, which includes some random -> Stochastic
- **Observation space**: continuous
- **Action space**: discrete
- **Balances exploration and exploitation**: using epsilon-greedy
    - Selects a random action with probability epsilon and decay epsilon over time

**Concept of algorithm**

DQN (Deep Q-Network) is conceptually similar to Linear Q-learning, but instead of using a linear function for function approximation, DQN uses a neural network to approximate the Q-value function.

DQN has 3 main components:
1. replay memory
    - A buffer used to store the agent’s past experiences as tuples:(state, action, reward, next state)
    - Using for traning, during training it randomly sample mini-batches of experiences from the buffer to update policy net.
    - mini-batches will breaks the temporal correlation between consecutive experiences and improves stability and sample efficiency.

2. Policy net
    - A neural network that approximates the Q-value.
    - Similar to the weight vector in Linear Q-learning but with hidden layers for greater function approximation.
    - The network is updated using TD error and gradient descent:
    $$\delta = r + \gamma \max_{a'} Q(s', a') - Q(s, a)$$
    - The loss function is the smooth L1 loss (Huber loss):
    $$
    \mathcal{L} =
    \begin{cases}
    \frac{1}{2} \delta^2 & \text{if } |\delta| < 1 \\
    |\delta| - \frac{1}{2} & \text{otherwise}
    \end{cases}
    $$
    - This behaves like MSE for small TD errors and like MAE for large errors — making it less sensitive to outliers.

3. Target net
    - A copy of the policy network that is used to calculate the Q-value for the next state.
    - It prevents the TD target from shifting too quickly, which can destabilize training.
    - Rather than updating the target network every step, DQN uses a soft update mechanism.
    $$\theta_{\text{target}} \leftarrow \tau \cdot \theta_{\text{policy}} + (1 - \tau) \cdot \theta_{\text{target}}$$
    Where:
    - $\theta_{\text{policy}}$​: weights of the policy network
    - $\theta_{\text{target}}$​: weights of the target network
    - $\tau$: soft update rate (between [0,1])
    
    Soft update is prevents rapid oscillations in learning
    
**The relationship of 3 component is**

![image3.png](img/image3.png)

source: https://www.theengineeringprojects.com/2024/01/deep-q-networks-dqn-reinforcement-learning.html

**Psuedo code of training DQN:**

#### 3. MC REINFORCE

- **Approach**: Policy-based
- **Policy type**: Stochastic
    - Learns a probability distribution over actions (Categorical distribution)
- **Observation space**: continuous
- **Action space**: discrete
- **Balances exploration and exploitation**: using stochastic policy
    - In MC REINFORCE action sample from probability distribution, this means the agent naturally explores, since each action has some non-zero probability of being selected.
    - Exploit start when policy increases the probability of high-reward actions through policy gradient updates

**Concept of algorithm**

MC REINFORCE is a Monte Carlo method which using policy gradient to learn stochastic policy. It using complete return of the entire episode from current policy to update new policy.

MC REINFORCE directly models the policy using a neural network (policy net) and optimizes it to maximize returns.

Action Selection:
- MC REINFORCE is policy-based, meaning it directly models the policy $\pi(a∣s)$ as a neural network (policy net).
- Given a state s, the policy net outputs a probability distribution over actions.
- A Categorical distribution is constructed from these probabilities, and an action is sampled from it

Policy Update
- In policy update we want to maximize the expected return:
$$J(\theta) = \mathbb{E} [ \sum_{t=0}^{T} r_t ]$$
- Using the episodic policy gradient theorem, we get a formula for how to change the policy parameters $\theta$ to increase $J(\theta)$:
$$\nabla_\theta J(\theta) = \mathbb{E} [\sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a|s) \cdot G_t ]$$
- Since most optimizers minimize a loss function, we define the loss as the negative of the policy gradient:
$$\mathcal{L} = -\nabla_\theta J(\theta) = - \sum_{t=0}^{T} \log \pi_\theta(a_t \mid s_t) \cdot G_t$$
- So, minimizing the loss is equivalent to maximizing the expected return.

from equation, to update the policy, we need:
1. Returns $G_t$​ at each timestep:
    - Calculate from the total discounted future rewards from time t: 
    $$G_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + ... = r_t + \gamma G_{t+1}​$$
2. Log-probabilities of selected actions:
    - For each timestep, we store:
    $$log\pi_{\theta}(a_t​∣s_t​)$$
    - This term gives us a score function gradient, which tells us how the log-probability of the action changes with respect to the policy parameters.


**Psuedo code of training MC REINFORCE:**

#### 4. A2C (Advantage Actor–Critic)

- **Approach**: Actor-Critic
- **Policy type**: Stochastic
- **Observation space**: continuous
- **Action space**: discrete / continuous (in this project we do discrete)
- **Balances exploration and exploitation**: 
    - using stochastic policy
        - In MC REINFORCE action sample from probability distribution, this means the agent naturally explores, since each action has some non-zero probability of being selected.
        - Exploit start when policy increases the probability of high-reward actions through policy gradient updates
    - using Entropy
        - Entropy measures the randomness of the policy.
        - If all actions have similar probabilities(explore) entropy is high and if strongly prefers one action (exploit), entropy is low.
        - Entropy loss is added (as a negative bonus) to the actor loss so it will make agent confident in that action.
        - A high entropy coefficient makes the agent explore more (slows down exploitation), while a low coefficient lets the agent exploit more aggressively.

**Concept of Algorithm**

A2C (Advantage Actor-Critic) is an actor-critic reinforcement learning algorithm that uses the advantage function to compute the actor’s loss and improve learning efficiency.
- The actor network selects actions based on the current policy. and updated using the advantage estimate.
- The critic network estimates the value of states to using for compute the advantage. and updated using the temporal difference (TD) error.

Component of A2C
- Actor-network: neural network that takes the state as input and outputs a probability distribution over actions (policy).
    - Update Actor network by using Advantage
    $$A = Q(s,a) - V(s)$$
    Where:
    - $Q(s,a)$: Expected return from taking action a in state s
    - $V(s)$: Expected return from state s following the policy
    - $A(s,a)$: Extra benefit of taking action a instead of the average action

    In A2C, advantage have estimate as:
    $$A_t = r + \gamma V(s') - V(s)$$

    Then use it to calculate actor loss:
    $$L_{\text{actor}} = -\log \pi(a_t \mid s_t) \cdot A_t$$

    - Entropy Bonus
        - An entropy term is added to the actor loss to encourage exploration.
        - Entropy is calculated from the action probabilities:
            - If actions have similar probabilities: Entropy will high (more exploration)
            - If actions have strongly prefers one action: Entropy will low (more exploitation)
    The total actor loss becomes:
    $$L_{\text{actor}} = -\log \pi(a_t \mid s_t) \cdot A_t - entropy_coef * entropy$$
    
    Including entropy in the loss helps the agent explore by keeping its action choices more varied early in training. Over time, it naturally becomes more confident as it learns which actions are better.
    - This loss is minimized using gradient descent to improve the value prediction accuracy.

- Critic-network: A neural network that estimates the value function V(s) (the expected return from a given state).
    - Update critic net by using TD-error:
    $$\delta = r + \gamma V(s') - V(s)$$

    - Loss function is mean square error of TD error:
    $$L_{\text{critic}} =\frac{1}{2} \delta^2$$

    - This loss is minimized using gradient descent to improve the value prediction accuracy.

**Psuedo code of training A2C:**

## Part 2: Setting up Cart-Pole Agent.

#### RL base class

**Replay Buffer Class**

This class using for collect experiences into memory which have size = buffer size and if more than this will handle by FIFO (first in first out) and sampling it for training. It's have 3 sub function
- add(state, action_idx, reward, next_state, done): add experience into memory.
- sample(): return random experinces size = batch size for use in training.
- __len__(): return number of experiences in memory

In [None]:
class ReplayBuffer:
    def __init__(self, buffer_size, batch_size = 1):
        """
        Initializes the replay buffer.

        Args:
            buffer_size (int): Maximum number of experiences the buffer can hold.
            batch_size (int): Number of experiences to sample per batch.
        """
        self.memory = deque(maxlen=buffer_size)
        self.batch_size = batch_size

    def add(self, state, action_idx, reward, next_state, done):
        """
        Adds an experience to the replay buffer.

        Args:
            state (Tensor): The current state of the environment.
            action_idx (int): The action index of action taken at this state.
            reward (float): The reward received after taking the action.
            next_state (Tensor): The next state resulting from the action.
            done (bool): Whether the episode has terminated.
        """
        self.memory.append((state, action_idx, reward, next_state, done))

    def sample(self):
        """
        Samples a batch of experiences from the replay buffer.

        Returns:
            tuple:
                - state_batch (Tensor): Batch of states.
                - action_batch (Tensor): Batch of actions.
                - reward_batch (Tensor): Batch of rewards.
                - next_state_batch (Tensor): Batch of next states.
                - done_batch (Tensor): Batch of terminal state flags.
        """
        experiences = random.sample(self.memory, k=self.batch_size)
        state_batch, action_batch, reward_batch, next_state_batch, done_batch = zip(*experiences)

        return (
            torch.stack(state_batch).to(device),
            torch.tensor(action_batch, dtype=torch.long).to(device),
            torch.tensor(reward_batch, dtype=torch.float).to(device),
            torch.stack(next_state_batch).to(device),
            torch.tensor(done_batch, dtype=torch.bool).to(device),
        )

    def __len__(self):
        """
        Returns the current size of the replay buffer.

        Returns:
            int: The number of stored experiences.
        """
        return len(self.memory)

**Base Algorithm**

This class is a base for others algorithm which is 
- initialize some important variables such as learning_rate, discout_factor, initial_epsilon, epsilon_decay, final_epsilon, num_of_action, action_range
- create a experience replay buffer memory
- have 3 sub function:
    - scale_action(action_index):
        Maps a discrete action in range [0, n] to range [action_min, action_max].
    - decay_epsilon
        Decay epsilon value to reduce exploration over time. (using in Linear Q-learning and DQN)
    - extract_policy_state(obs):
        Extract policy state from dict to numpy array then clip and normalize.

scale_action

In [None]:
def scale_action(self, action):
    """
    Maps a discrete action in range [0, n] to a continuous value in [action_min, action_max].

    Args:
        action (int): Discrete action in range [0, n].
    
    Returns:
        torch.Tensor: Scaled action tensor.
    """
    min_action, max_action = self.action_range
    action_step = (max_action - min_action) / (self.num_of_action - 1)
    action_value = min_action + action * action_step

    return torch.tensor([[action_value]], dtype=torch.float32)

decay_epsilon

In [None]:
def decay_epsilon(self):
    """
    Decay epsilon value to reduce exploration over time.
    """
    self.epsilon = max(self.final_epsilon, self.epsilon * self.epsilon_decay)
    return self.epsilon

extract_policy_state

In [None]:
def extract_policy_state(self, obs):
    """
    Extract policy state from dict to numpy array and normalize.

    Args:
        obs (dict): State observation.
    
    Returns:
        np.ndarray: Normalized policy state.
    """
    policy = obs['policy']
    state = np.array(policy[:, :4].tolist(), dtype=np.float32)
    
    # Define bounds as arrays
    bound = np.array([ 3,  np.deg2rad(24),  5,  5], dtype=np.float32)
    
    # Clip to bounds
    state = np.clip(state, -1*bound, bound)
    
    return state / bound

#### Linear Q-learning

Linear Q-learning code is write following pseudo code in part1 and split to 6 function.

q(state, action:optional): estimates the Q-value for a given state and (optionally) action by dot product between state and weight.

In [None]:
def q(self, state, a=None):
    """
    Linearly estimates the Q-value for a given state and (optionally) action.

    Args:
        state (np.array): The current state observation, containing feature representations.
        a (int, optional): Action index. If None, returns Q-values for all actions.

    Returns:
        float or np.array: Q(s, a) if action is specified; otherwise, Q(s, :) for all actions.
    """
    if a==None:
        # Get q values from all action in state
        return state @ self.w
    else:
        # Get q values given action & state
        return state @ self.w[:, a]

select_action(state): Select an action from Q value or random based on an epsilon-greedy policy.

In [None]:
def select_action(self, state):
    """
    Select an action based on an epsilon-greedy policy.
    
    Args:
        state (np.array): The current state of the environment.
    
    Returns:
        tuple (int, Tensor):
            - int: Index of the selected action.
            - Tensor: The selected action.
    """
    if np.random.rand() < self.epsilon:
        action_index = np.random.randint(self.num_of_action)
    else:
        q_values = self.q(state)
        action_index = int(np.argmax(q_values))

    return action_index, self.scale_action(action_index)

update(state, action_idx, reward, next_state, terminated): Updates the weight vector using the Temporal Difference (TD) error and Gradient descent update.

In [None]:
def update(
    self,
    state,
    action_idx: int,
    reward: float,
    next_state,
    terminated: bool
):
    """
    Updates the weight vector using the Temporal Difference (TD) error 
    in Q-learning with linear function approximation.

    Args:
        state (np.array): The current state observation, containing feature representations.
        action_idx (int): The action index of action taken in the current state.
        reward (float): The reward received for taking the action.
        next_state (np.array): The next state observation.
        terminated (bool): Whether the episode has ended.

    Returns:
        float: Temporal Difference (TD) error
    """
    q_current = self.q(state, action_idx)
    q_next = np.max(self.q(next_state))
    td_target = reward + (self.discount_factor * q_next)
    td_error = td_target - q_current

    # Gradient descent update
    self.w[:, action_idx] += self.lr * td_error * state

    return td_error

learn(env): This is the main of code. In this function we train the agent for 1 episode.

In [None]:
def learn(self, env):
    """
    Train the agent for 1 episode. (can using with multi environments)

    Args:
        env: The environment in which the agent interacts.

    Returns:
        Tuple[List[float], List[int], List[float]]:
            - List[float]: Episode return for each environment.
            - List[int]: Alive time steps for each environment.
            - List[float]: Average TD error for each environment.
    """
    obs_list, _ = env.reset()
    state_list = self.extract_policy_state(obs_list)
    num_envs = len(state_list)
    dones = [False] * num_envs
    cumulative_rewards = [0.0] * num_envs
    steps = [0] * num_envs
    losses = [[] for _ in range(num_envs)]

    while not all(dones):
        # agent stepping
        actions_idx = []
        actions = []

        for i, state in enumerate(state_list):
            if dones[i]:
                actions_idx.append(0)
                actions.append(torch.tensor([[0.0]], dtype=torch.float32))
            else:
                a_idx, a_cont = self.select_action(state)
                actions_idx.append(a_idx)
                actions.append(a_cont)
        actions = torch.cat(actions, dim=0)

        # env stepping
        next_obs_list, rewards, terminations, truncations, _ = env.step(actions)
        next_state_list = self.extract_policy_state(next_obs_list)
        
        for i in range(num_envs):
            if not dones[i]:
                done = bool(terminations[i].item()) or bool(truncations[i].item())
                loss = self.update(state_list[i], actions_idx[i], rewards[i].item(), next_state_list[i], done)
                losses[i].append(loss)
                cumulative_rewards[i] += rewards[i].item()
                steps[i] += 1
                dones[i] = done
                state_list[i] = next_state_list[i]
        
    self.decay_epsilon()
    avg_losses = [np.mean(l) if l else 0.0 for l in losses]

    return cumulative_rewards, steps, avg_losses

save_model(path, filename), load_model(path, filename): save and load weight from input path and filename which is save as json file.

In [None]:
def save_model(self, path, filename):
    """
    Save weight parameters.

    Args:
        path (str): Directory to save model.
        filename (str): Name of the file.
    """
    os.makedirs(path, exist_ok=True)
    full_path = os.path.join(path, filename)
    with open(full_path, 'w') as f:
        json.dump(self.w.tolist(), f)
        
def load_model(self, path, filename):
    """
    Load weight parameters.

    Args:
        path (str): Directory to save model.
        filename (str): Name of the file.
    """
    full_path = os.path.join(path, filename)
    with open(full_path, 'r') as f:
        self.w = np.array(json.load(f))

#### DQN

DQN code is write following pseudo code in part1 and split to 8 function. And also have DQN_network class which is class for policy_net.

DQN_network: Neural network model for the Deep Q-Network algorithm for calculate expected Q-value for each state and action Q(s,a)
- In DQN_network neural network is init as Linear -> ReLU -> Dropout -> Linear
- Have function forward to pass data through the network
- In DQN using this neural network for 2 network which is
    - policy_net: to estimate Q-value for action selection.
    - target_net: soft update from policy_net using for update policy_net.

In [None]:
class DQN_network(nn.Module):
    """
    Neural network model for the Deep Q-Network algorithm.
    
    Args:
        n_observations (int): Number of input features.
        hidden_size (int): Number of hidden neurons.
        n_actions (int): Number of possible actions.
        dropout (float): Dropout rate for regularization.
    """
    def __init__(self, n_observations, hidden_size, n_actions, dropout):
        super(DQN_network, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(n_observations, hidden_size),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_size, n_actions)
        )

    def forward(self, x):
        """
        Forward pass through the network.
        
        Args:
            x (Tensor): Input state tensor.
        
        Returns:
            Tensor: Q-value estimates for each action.
        """
        return self.net(x)

select_action(state): Select an action from Q value or random based on an epsilon-greedy policy.

In [None]:
def select_action(self, state):
    """
    Select an action based on an epsilon-greedy policy.
    
    Args:
        state (np.array): The current state of the environment.
    
    Returns:
        int: action index
        Tensor: The selected action.
    """
    if np.random.rand() < self.epsilon:
        action_idx = np.random.randint(self.num_of_action)
    else:
        with torch.no_grad():
            q_values = self.policy_net(state)
            action_idx = q_values.argmax(1).item()

    return action_idx, self.scale_action(action_idx)

calculate_loss(non_final_mask, non_final_next_states, state_batch, action_batch, reward_batch): Computes the loss for policy optimization.

In [None]:
def calculate_loss(self, non_final_mask, non_final_next_states, state_batch, action_batch, reward_batch):
    """
    Computes the loss for policy optimization.

    Args:
        non_final_mask (Tensor): Mask indicating which states are non-final.
        non_final_next_states (Tensor): The next states that are not terminal.
        state_batch (Tensor): Batch of current states.
        action_batch (Tensor): Batch of actions taken.
        reward_batch (Tensor): Batch of received rewards.
    
    Returns:
        Tensor: Computed loss.
    """
    state_action_values = self.policy_net(state_batch).gather(1, action_batch.unsqueeze(1)) # shape: [batch_size, 1]
    next_state_values = torch.zeros(self.batch_size , device=self.device) # shape: [batch_size]

    if non_final_next_states.size(0) > 0:
        next_state_values[non_final_mask] = self.target_net(non_final_next_states).max(1)[0].detach() # shape: [num_non_final]
        
    expected_state_action_values = (reward_batch + (self.discount_factor * next_state_values)).unsqueeze(1)
    return F.smooth_l1_loss(state_action_values, expected_state_action_values)

generate_sample(batch_size): Generates a batch sample from memory(Replay Buffer) for training.

In [None]:
def generate_sample(self, batch_size):
    """
    Generates a batch sample from memory for training.

    Returns:
        Tuple: A tuple containing:
            - non_final_mask (Tensor): A boolean mask indicating which states are non-final.
            - non_final_next_states (Tensor): The next states that are not terminal.
            - state_batch (Tensor): The batch of current states.
            - action_batch (Tensor): The batch of actions taken.
            - reward_batch (Tensor): The batch of rewards received.
    """
    if len(self.memory) < batch_size:
        return None
    states, actions, rewards, next_states, dones = self.memory.sample()
    non_final_mask = ~dones
    non_final_next_states = next_states[non_final_mask]
    return non_final_mask, non_final_next_states, states, actions, rewards

update_policy(): Update the policy using the calculated loss from calculate loss function.

In [None]:
def update_policy(self):
    """
    Update the policy using the calculated loss.

    Returns:
        float: Loss value after the update.
    """
    # Generate a sample batch
    if self.memory.__len__() < self.batch_size:
        return
    sample = self.generate_sample(self.batch_size)
    if sample is None:
        return
    non_final_mask, non_final_next_states, state_batch, action_batch, reward_batch = sample
    
    # Compute loss
    loss = self.calculate_loss(non_final_mask, non_final_next_states, state_batch, action_batch, reward_batch)

    # Perform gradient descent step
    self.optimizer.zero_grad()
    loss.backward()
    self.optimizer.step()
    return loss.item()

update_target_networks(): Soft update of target network weights using Polyak averaging.

In [None]:
def update_target_networks(self):
    """
    Soft update of target network weights using Polyak averaging.
    """
    # Retrieve the state dictionaries (weights) of both networks
    target_net_state_dict = self.target_net.state_dict()
    policy_net_state_dict = self.policy_net.state_dict()
    
    # Apply the soft update rule to each parameter in the target network
    for key in target_net_state_dict:
        target_net_state_dict[key] = self.tau * policy_net_state_dict[key] + (1.0 - self.tau) * target_net_state_dict[key]
    
    # Load the updated weights into the target network
    self.target_net.load_state_dict(target_net_state_dict)

learn(env): This is the main of code. In this function we train the agent for 1 episode. Statrt with reset enironment and loop for playing action then sampling experience for update policy_net and target_net

In [None]:
def learn(self, env):
    """
    Train the agent for 1 episode. (can using with multi environments)

    Args:
        env: The environment to train in.

    Returns:
        Tuple[List[float], List[int], List[float]]:
            - List[float]: Episode return for each environment.
            - List[int]: Alive time steps for each environment.
            - List[float]: Average TD error for each environment.
    """
    obs_list, _ = env.reset()
    state_list = self.extract_policy_state(obs_list)
    num_envs = len(state_list)
    dones = [False] * num_envs
    cumulative_rewards = [0.0] * num_envs
    steps = [0] * num_envs
    loss = [0.0] * num_envs

    while not all(dones):
        # Predict action from the policy network
        actions_idx = []
        actions = []

        for i, state in enumerate(state_list):
            if dones[i]:
                actions_idx.append(0)
                actions.append(torch.tensor([[0.0]], dtype=torch.float32))
            else:
                state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0).to(self.device)
                a_idx, a_cont = self.select_action(state_tensor)
                actions_idx.append(a_idx)
                actions.append(a_cont)
        actions = torch.cat(actions, dim=0)

        # Execute action in the environment and observe next state and reward
        next_obs_list, rewards, terminations, truncations, _ = env.step(actions)
        next_state_list = self.extract_policy_state(next_obs_list)

        # Store the transition in memory
        for i in range(num_envs):
            if not dones[i]:
                done = bool(terminations[i].item()) or bool(truncations[i].item())
                self.memory.add(
                    torch.tensor(state_list[i], dtype=torch.float32),
                    actions_idx[i],
                    rewards[i].item(),
                    torch.tensor(next_state_list[i], dtype=torch.float32),
                    done
                )
                cumulative_rewards[i] += rewards[i].item()
                steps[i] += 1
                dones[i] = done
                state_list[i] = next_state_list[i]

        # Perform one step of the optimization (on the policy network)
        loss = self.update_policy()
        # Soft update of the target network's weights
        self.update_target_networks()

    self.decay_epsilon()

    return cumulative_rewards, steps, loss

save_model(path, filename), load_model(path, filename): save and load network model from input path and filename which is save as tensor.

In [None]:
def save_model(self, path, filename):
    """
    Save model network.

    Args:
        path (str): Directory to save model.
        filename (str): Name of the file.
    """
    os.makedirs(path, exist_ok=True)
    full_path = os.path.join(path, filename)
    torch.save({
        'policy_net': self.policy_net.state_dict(),
        'target_net': self.target_net.state_dict(),
        'optimizer': self.optimizer.state_dict(),
    }, full_path)

def load_model(self, path, filename):
    """
    Load model network.

    Args:
        path (str): Directory to save model.
        filename (str): Name of the file.
    """
    full_path = os.path.join(path, filename)
    checkpoint = torch.load(full_path, map_location=self.device)
    self.policy_net.load_state_dict(checkpoint['policy_net'])
    self.target_net.load_state_dict(checkpoint['target_net'])
    self.optimizer.load_state_dict(checkpoint['optimizer'])

#### MC REINFORCE

MC REINFORCE code is write following pseudo code in part1 and split to 8 function. And also have MC_REINFORCE_network class which is class for policy_net.

MC_REINFORCE network: Neural network model for the MC_REINFORCE algorithm. for calculate probability of selecting each action (policy) for each state.
- In MC_REINFORCE_network neural network is init as Linear -> ReLU -> Dropout -> Linear -> Softmax
- Have function forward to pass data through the network

In [None]:
class MC_REINFORCE_network(nn.Module):
    """
    Neural network for the MC_REINFORCE algorithm.
    
    Args:
        n_observations (int): Number of input features.
        hidden_size (int): Number of hidden neurons.
        n_actions (int): Number of possible actions.
        dropout (float): Dropout rate for regularization.
    """

    def __init__(self, n_observations, hidden_size, n_actions, dropout):
        super(MC_REINFORCE_network, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(n_observations, hidden_size),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_size, n_actions),
            nn.Softmax(dim=-1)  # Output probabilities
        )

    def forward(self, x):
        """
        Forward pass through the network.
        
        Args:
            x (Tensor): Input tensor.
        
        Returns:
            Tensor: Output tensor representing action probabilities.
        """
        return self.net(x)

select_action(state): Select an action from probability(policy) by sampling.

In [None]:
def select_action(self, state):
    """
    Selects an action based on the current policy.
    
    Args:
    state (Tensor): The current state of the environment.

    Returns:
        Tuple[int, Tensor, distributions.Categorical]:
            - int: Index of the selected action.
            - Tensor: Scaled continuous action.
            - Categorical: Torch distribution object used for sampling/log_probs.
    """
    probs = self.policy_net(state).to(self.device)
    dist = distributions.Categorical(probs)
    action_idx = dist.sample()
    action = self.scale_action(action_idx.item())
    return action_idx.item(), action, dist

calculate_stepwise_returns(rewards): Calculate return of each step in episode from reward list

In [None]:
def calculate_stepwise_returns(self, rewards):
    """
    Compute stepwise returns for the trajectory.

    Args:
        rewards (list(float)): List of rewards obtained in the episode.
    
    Returns:
        Tensor: Normalized stepwise returns.
    """
    R = 0
    returns = []
    for r in reversed(rewards):
        R = r + self.discount_factor * R
        returns.insert(0, R)
    returns = torch.tensor(returns, dtype=torch.float32).to(self.device)
    if len(returns) > 1:
        returns = (returns - returns.mean()) / (returns.std() + 1e-8)
    return returns

generate_trajectory(env): Run agent with current policy until end 1 episode and store log_probs and return for calculating loss function

In [None]:
def generate_trajectory(self, env):
    """
    Generate a trajectory by interacting with the environment. (can using with multi environments)

    Args:
        env: The environment object.
    
    Returns:
        Tuple(List[float], List[int], List[Tensor], List[Tensor], List[List[Tuple]]):
        - List[float]: Total return for each environment.
        - List[int]: Episode length for each environment.
        - List[Tensor]: Discounted and normalized return for each step in each environment.
        - List[Tensor]: Log probabilities of the actions taken at each step per environment.
        - List[List[Tuple]]: Full trajectory (state, action, reward) per environment.
        
    """
    obs_list, _ = env.reset()
    state_list = self.extract_policy_state(obs_list)
    num_envs = len(state_list)
    dones = [False] * num_envs
    cumulative_rewards = [0.0] * num_envs
    steps = [0] * num_envs
    log_probs_list = [[] for _ in range(num_envs)]
    rewards_list = [[] for _ in range(num_envs)]
    trajectory_list = [[] for _ in range(num_envs)]
    timestep = 0
    while not all(dones):
        actions_idx = []
        actions = []
        dists = []

        for i, state in enumerate(state_list):
            if dones[i]:
                actions_idx.append(0)
                actions.append(torch.tensor([[0.0]], dtype=torch.float32))
                dists.append(None)
            else:
                state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0).to(self.device)
                action_idx, action, dist = self.select_action(state_tensor)
                actions.append(action)
                actions_idx.append(action_idx)
                dists.append(dist)
        actions = torch.cat(actions, dim=0).to(self.device)

        next_obs_list, rewards, terminations, truncations, _ = env.step(actions)
        next_state_list = self.extract_policy_state(next_obs_list)

        for i in range(num_envs):
            if not dones[i]:
                done = bool(terminations[i].item()) or bool(truncations[i].item())
                log_probs_list[i].append(dists[i].log_prob(torch.tensor(actions_idx[i]).to(self.device)))
                rewards_list[i].append(rewards[i].item())
                trajectory_list[i].append((state_list[i], actions_idx[i], rewards[i].item()))
                cumulative_rewards[i] += rewards[i].item()
                steps[i] += 1
                dones[i] = done
                state_list[i] = next_state_list[i]

        timestep += 1

    all_returns = []
    all_log_probs = []
    for i in range(num_envs):
        stepwise_returns = self.calculate_stepwise_returns(rewards_list[i])
        all_returns.append(stepwise_returns)
        all_log_probs.append(torch.stack(log_probs_list[i]).squeeze(-1))

    return cumulative_rewards, steps, all_returns, all_log_probs, trajectory_list

calculate_loss(returns_batch, log_probs_batch): using logprobs list of each step and return list of each step to calculate loss function of MC_REINFORCE network.

In [None]:
def calculate_loss(self, returns_batch, log_probs_batch):
    """
    Compute the loss for policy optimization.

    Args:
        returns_batch (List[Tensor]): List of return tensors for each trajectory.
        log_probs_batch (List[Tensor]): List of log-probability tensors for each trajectory.
    
    Returns:
        Tensor: Computed loss.
    """
    loss = torch.tensor(0.0, device=self.device)
    for R, log_probs in zip(returns_batch, log_probs_batch):
        loss += -(log_probs * R).sum()
    loss /= sum(len(R) for R in returns_batch)
    return loss

update_policy(returns_batch, log_probs_batch): Calculate loss function and update MC_REINFORCE network

In [None]:
def update_policy(self, returns_batch, log_probs_batch):
    """
    Update the policy using the calculated loss.

    Args:
        returns_batch (List[Tensor]): List of return tensors for each trajectory.
        log_probs_batch (List[Tensor]): List of log-probability tensors for each trajectory.
    
    Returns:
        float: Loss value after the update.
    """
    loss = self.calculate_loss(returns_batch, log_probs_batch)
    self.optimizer.zero_grad()
    loss.backward()
    self.optimizer.step()
    return loss.item()

learn(env): This is the main function that using in train code. In this function we train the agent for 1 episode. Statrt with reset enironment and generate trajectory that loop for playing action and store data then use that store data for update policy_net

In [None]:
def learn(self, env):
    """
    Train the agent on a single episode. (can using with multi environments)

    Args:
        env: The environment to train in.
    
    Returns:
        Tuple(List[float], List[int], float, List[List[Tuple]]):
            - List[float]: Total return per environment.
            - List[int]: Episode length per environment.
            - float: Policy loss after the update.
            - List[List[Tuple]]: Trajectory of (state, action, reward) per env.
    """
    self.policy_net.train()
    episode_return, step, stepwise_returns, log_prob_actions, trajectory = self.generate_trajectory(env)
    loss = self.update_policy(stepwise_returns, log_prob_actions)
    return episode_return, step, loss, trajectory

save_model(path, filename), load_model(path, filename): save and load network model from input path and filename which is save as tensor.

In [None]:
def save_model(self, path, filename):
    """
    Save model network.

    Args:
        path (str): Directory to save model.
        filename (str): Name of the file.
    """
    os.makedirs(path, exist_ok=True)
    full_path = os.path.join(path, filename)
    torch.save({
        'policy_net': self.policy_net.state_dict(),
        'optimizer': self.optimizer.state_dict(),
    }, full_path)

def load_model(self, path, filename):
    """
    Load model network.

    Args:
        path (str): Directory to save model.
        filename (str): Name of the file.
    """
    full_path = os.path.join(path, filename)
    checkpoint = torch.load(full_path, map_location=self.device)
    self.policy_net.load_state_dict(checkpoint['policy_net'])
    self.optimizer.load_state_dict(checkpoint['optimizer'])

#### A2C

A2C code is write following pseudo code in part1 and split to 7 function. And also have Actor_network class which is class for actor_net. and Critic_network class which is class for critic_net.

Actor network: Neural network model for the A2C algorithm. for calculate probability of selecting each action (policy) for each state.
- In Actor_network neural network is init as 
    - for discrete:Linear -> ReLU -> Linear -> ReLU -> Linear -> Softmax
    - for continuous:
        - for mu:Linear -> ReLU -> Linear -> ReLU -> Linear 
        - for std:Linear -> ReLU -> Linear -> ReLU -> Linear 
- Have function forward to pass data through the network

In [None]:
class A2C_Actor(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim=1, is_discrete=True):
        
        super(A2C_Actor, self).__init__()
        self.is_discrete = is_discrete
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
        )

        if self.is_discrete:
            self.fc1 = nn.Linear(hidden_dim, output_dim)  # for logits → Categorical
        else:
            self.mu_net = nn.Linear(hidden_dim, 1)
            self.std_net = nn.Linear(hidden_dim, 1)

        self.init_weights()

    def init_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.xavier_uniform_(m.weight)  # Xavier initialization
                nn.init.zeros_(m.bias)  # Initialize bias to 0

    def forward(self, state):
        x = self.net(state)
        if self.is_discrete:
            logits = self.fc1(x)
            probs = F.softmax(logits, dim=-1)
            return probs
        else:
            mu = self.mu_net(x)
            log_std = self.std_net(x)
            std = torch.exp(log_std.clamp(-20, 2))  # stability
            return mu, std

Critic network: Neural network model for the A2C algorithm. for calculate expected value of state.
- In Critic_network neural network is init as Linear -> ReLU -> Linear -> ReLU -> Linear
- Have function forward to pass data through the network

In [None]:
class A2C_Critic(nn.Module):
    def __init__(self, state_dim, hidden_dim):
        
        super(A2C_Critic, self).__init__()

        self.net = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1)
        )
        self.init_weights()

    def init_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.kaiming_uniform_(m.weight, nonlinearity='relu')  # Kaiming initialization
                nn.init.zeros_(m.bias)  # Initialize bias to 0

    def forward(self, state):
        return self.net(state)

select_action(state): Selects an action based on the current policy from actor_net

In [None]:
def select_action(self, state):
    with torch.no_grad():
        if self.is_discrete:
            probs = self.actor(state)
            dist = distributions.Categorical(probs)
            action_idx = dist.sample()
            action = self.scale_action(action_idx.item())
            return action_idx.item(), action
        else:
            mu, std = self.actor(state)
            base_dist = distributions.Normal(mu, std)
            dist = distributions.TransformedDistribution(base_dist, [distributions.TanhTransform(cache_size=1)])
            action = dist.sample()
            scaled_action = action * self.action_range[1]
            return action, scaled_action

generate_sample(batch_size): Generates a batch sample from memory(Replay Buffer) for training.

In [None]:
def generate_sample(self, batch_size):
    if len(self.memory) < batch_size:
        return None
    states, actions, rewards, next_states, dones = self.memory.sample()
    non_final_mask = ~dones
    return non_final_mask, next_states, states, actions, rewards

calculate_loss(non_final_mask, next_states, state_batch, action_batch, reward_batch): Computes the loss for optimize actor_net and critic_net following equation in part1

In [None]:
def calculate_loss(self, non_final_mask, next_states, state_batch, action_batch, reward_batch):
    value = self.critic(state_batch).squeeze(-1)  # Estimate the value of the current state
    next_values = self.critic(next_states).squeeze(-1).detach()
    non_final_mask = non_final_mask.float()

    # Target with full batch
    target = reward_batch + non_final_mask * self.discount_factor * next_values
    critic_loss = F.mse_loss(value, target)  # Calculate the critic loss

    advantage = (target - value).detach()

    if self.is_discrete:
        probs = self.actor(state_batch)
        dist = distributions.Categorical(probs)
        log_probs = dist.log_prob(action_batch)
        entropy = dist.entropy().sum(dim=-1)
    else:
        mu, std = self.actor(state_batch)
        base_dist = distributions.Normal(mu, std)
        dist = distributions.TransformedDistribution(base_dist, [distributions.TanhTransform(cache_size=1)])
        eps = 1e-6
        action_batch = action_batch / self.action_range[1]  # scale to [-1, 1]
        action_batch = action_batch.clamp(-1 + eps, 1 - eps)

        log_probs = dist.log_prob(action_batch).sum(dim=-1)
        entropy = base_dist.entropy().sum(dim=-1)

    actor_loss = (-log_probs * advantage - self.entropy_coef * entropy).mean()

    return actor_loss, critic_loss

update_policy(): sampling experience to calculate loss function and update actor and critic net

In [None]:
def update_policy(self):
    sample = self.generate_sample(self.batch_size)
    if sample is None:
        return 0.0, 0.0
    non_final_mask, next_states, state_batch, action_batch, reward_batch = sample

    reward_batch = (reward_batch - reward_batch.mean()) / (reward_batch.std() + 1e-7)

    actor_loss, critic_loss = self.calculate_loss(non_final_mask, next_states, state_batch, action_batch, reward_batch)
    
    self.optimizer_critic.zero_grad()
    critic_loss.backward()
    self.optimizer_critic.step()

    self.optimizer_actor.zero_grad()
    actor_loss.backward()
    self.optimizer_actor.step()


    return actor_loss.item(), critic_loss.item()

learn(env): This is the main function that using in train code. In this function we train the agent for 1 episode. Statrt with reset enironment and loop to add experience to memory -> sampling -> calculate loss -> update actor and critic net until agent terminate

In [None]:
def learn(self, env):
    obs_list, _ = env.reset()
    state_list = self.extract_policy_state(obs_list)
    num_envs = len(state_list)
    dones = [False] * num_envs
    cumulative_rewards = [0.0] * num_envs
    steps = [0] * num_envs
    actor_losses = []
    critic_losses = []

    while not all(dones):
        actions_idx = []
        actions = []

        for i, state in enumerate(state_list):
            if dones[i]:
                actions_idx.append(0)
                actions.append(torch.tensor([[0.0]], dtype=torch.float32))
            else:
                state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0).to(self.device)
                a_idx, a_cont = self.select_action(state_tensor)
                actions_idx.append(a_idx)
                actions.append(a_cont)
        actions = torch.cat(actions, dim=0)

        next_obs_list, rewards, terminations, truncations, _ = env.step(actions)
        next_state_list = self.extract_policy_state(next_obs_list)

        for i in range(num_envs):
            if not dones[i]:
                done = bool(terminations[i].item()) or bool(truncations[i].item())
                self.memory.add(
                    torch.tensor(state_list[i], dtype=torch.float32),
                    actions_idx[i],
                    rewards[i].item(),
                    torch.tensor(next_state_list[i], dtype=torch.float32),
                    done
                )
                cumulative_rewards[i] += rewards[i].item()
                steps[i] += 1
                dones[i] = done
                state_list[i] = next_state_list[i]

        actor_loss, critic_loss = self.update_policy()
        actor_losses.append(actor_loss)
        critic_losses.append(critic_loss)

    return cumulative_rewards, steps, np.mean(actor_losses), np.mean(critic_loss)

save_model(path, filename), load_model(path, filename): save and load network model from input path and filename which is save as tensor.

In [None]:
def save_model(self, path, filename):
    """
    Save model network.

    Args:
        path (str): Directory to save model.
        filename (str): Name of the file.
    """
    os.makedirs(path, exist_ok=True)
    full_path = os.path.join(path, filename)
    torch.save({
        'actor': self.actor.state_dict(),
        'critic': self.critic.state_dict(),
        'optimizer_actor': self.optimizer_actor.state_dict(),
        'optimizer_critic': self.optimizer_critic.state_dict(),
    }, full_path)

def load_model(self, path, filename):
    """
    Load model network.

    Args:
        path (str): Directory to save model.
        filename (str): Name of the file.
    """
    full_path = os.path.join(path, filename)
    checkpoint = torch.load(full_path, map_location=self.device)
    self.actor.load_state_dict(checkpoint['actor'])
    self.critic.load_state_dict(checkpoint['critic'])
    self.optimizer_actor.load_state_dict(checkpoint['optimizer_actor'])
    self.optimizer_critic.load_state_dict(checkpoint['optimizer_critic'])

## Part 3: Trainning & Playing to stabilize Cart-Pole Agent.

Implement the training loop collect data, analyze results, and save models for evaluating agent performance.

In each algorithm we put training loop of each step in learn(env) function so in training loop in train code just loop learn for episode time. And log data in wandb (log data have different in each model)

In [None]:
# reset environment
timestep = 0
sum_reward = 0
# simulate environment
while simulation_app.is_running():
    # run everything in inference mode
    # with torch.inference_mode():
    
    for episode in tqdm(range(n_episodes)):
        cumulative_rewards, steps, losses = agent.learn(env)

        cumulative_reward = sum(cumulative_rewards) / len(cumulative_rewards)
        step = sum(steps) / len(steps)
        loss = sum(losses) / len(losses)

        moving_avg_window.append(cumulative_reward)
        moving_avg_reward = sum(moving_avg_window) / len(moving_avg_window)

        moving_avg_window2.append(step)
        moving_avg_step = sum(moving_avg_window2) / len(moving_avg_window2)

        moving_avg_window3.append(loss)
        moving_avg_loss = sum(moving_avg_window3) / len(moving_avg_window3)
        
        wandb.log({
            "avg_reward" : moving_avg_reward,
            "reward" : cumulative_reward,
            "avg_step" : moving_avg_step,
            "step" : step,
            "avg_loss" : moving_avg_loss,
            "loss" : loss,
            "epsilon" : agent.epsilon
        })

        sum_reward += cumulative_reward
        if episode % 100 == 0:
            print("avg_score: ", sum_reward / 100.0)
            sum_reward = 0
            print(agent.epsilon)

            # Save Q-Learning agent
            w_file = f"{Algorithm_name}_{episode}_{num_of_action}_{action_range[1]}.json"
            full_path = os.path.join(f"model/{task_name}", f"{Algorithm_name}/{exp_name}")
            agent.save_model(full_path, w_file)
    
    print('Complete')
    # agent.plot_durations(show_result=True)
    plt.ioff()
    plt.show()
        
    if args_cli.video:
        timestep += 1
        # Exit the play loop after recording one video
        if timestep == args_cli.video_length:
            break

    break
# ==================================================================== #

# close the simulator
wandb.finish()
env.close()

#### Linear Q-learning

log data
- reward
- step (episode lenght)
- loss
- epsilon

Hyper parameter
- num_of_action
- action_range
- learning_rate
- n_episodes
- initial_epsilon
- epsilon_decay
- final_epsilon
- discount_factor

In linear Q-lerning experiment: I have choose important parameter for tuning which is learning_rate, epsilon_decay, discount_factor

##### experiment 1: changing learning rate
Learning rate is using for controls how much new weight overrides old weight.

In this experiment have change learning rate to 2 value [0.0001, 0.001]

Hypothesis: Changing learning rate higher make faster learning but higher variance.

Reward, step, loss, epsilon graph:
![Linear_lr.png](img/Linear_lr.png)

conclude: 

A higher learning rate (0.001) was expected to speed up learning, but in practice, it caused unstable updates that pushed the weights away from optimal values. As a result, the agent failed to solve the task.

In contrast, a lower learning rate (0.0001) led to more stable learning and higher rewards, confirming that learning rate 0.0001 (gradual updates) were better suited for the Linear Q-Learning model.

##### experiment 2: changing discount factor
Discount factor determines how much the agent values future rewards compared to immediate ones.

In this experiment have change discount factor to 2 value [0.95, 0.99]

Hypothesis: A higher discount factor (0.99) encourages long-term reward planning, leading to potentially higher returns, but may cause greater variance in learning

Reward, step, loss, epsilon graph:
![Linear_d.png](img/Linear_d.png)

conclude: 
A higher discount factor (0.99) leads to slightly more variance, but similar peak performance compared to 0.95. It promotes longer-term planning but does not significantly outperform in this setup.

##### experiment 3: changing epsilon decay
Epsilon decay is controls how fast the agent shifts from exploration to exploitation.

In this experiment have change epsilon decay to 2 value [0.9995, 0.9993]

Hypothesis: A higher epsilon decay (0.9995) results in more exploration, leading to slower learning but more accurate and stable performance in the long term.

Reward, step, loss, epsilon graph:
![Linear_epsilon.png](img/Linear_epsilon.png)

conclude: 

A lower epsilon decay (0.9993) caused the agent to exploit too early, resulting in limited exploration and poor performance.

In contrast, higher decay (0.9995) allowed for more exploration, helping the agent reach higher rewards and better long-term performance.

#### DQN

log data
- reward
- step (episode lenght)
- loss
- epsilon

Hyper parameter
- num_of_action
- action_range
- learning_rate
- n_episodes
- initial_epsilon
- epsilon_decay
- final_epsilon
- discount_factor
- hidden_dim
- buffer_size
- batch_size
- dropout
- tau

In the DQN experiment, the following hyperparameters were selected for tuning: hidden_dim, buffer_size, batch_size, dropout, tau, and learning_rate. These parameters directly affect the capacity, stability, and learning dynamics of the neural network and replay mechanism in DQN.

We did not vary discount_factor or epsilon decay in this experiment, as their impact has already been observed and analyzed in the Linear Q-Learning experiments. Since DQN also uses an epsilon-greedy policy and discounted returns in a similar way, we expect them to behave similarly in DQN.

##### experiment 1: changing hidden_dim
Hidden dim is controls capacity of the neural network. A larger hidden dimension allows the model to represent more complex Q-functions.

In this experiment have change hidden_dim to 3 value [64, 128, 256]

**Hypothesis:** Increasing hidden_dim improves the model's ability to approximate Q-values, leading to better learning performance.

**Reward, step, loss, epsilon graph:**
![DQN_hidden_dim.png](img/DQN_hidden_dim.png)

**conclude:** 

As hypothesized, higher hidden dimensions (128 and 256) performed better than 64, showing higher rewards and longer episodes.
However, the difference between 128 and 256 is minor, suggesting that 128 may be enough for this task, and increasing model size may make overfitting and higher variance in action selection due to more complex Q-value estimation.

##### experiment 2: changing buffer_size
Buffer size is controls size of memory that stores past experiences.

In this experiment have change buffer_size to 2 value [1000, 5000]

**Hypothesis:** Increasing buffer_size make agent have more choice when sampling and make agent n
ot over-fittng in recent samples.

**Reward, step, loss, epsilon graph:**
![DQN_buffer.png](img/DQN_buffer.png)

**conclude:** 

From the graph, we observe that buffer size 5000 introduces more variance in training performance.
This may be caused by the agent learning from older, less relevant experiences, which increases the variance in sampled batches.

Although larger buffer sizes improve sample diversity, they can also reduce training stability if too many outdated transitions are used.

##### experiment 3: changing batch_size
Batch size is determines how many samples are used in each training update.

In this experiment have change batch_size to 3 value [64, 128, 256]

**Hypothesis:** Larger batch sizes provide more stable gradient estimates, which should reduce update variance and lead to smoother learning.

**Reward, step, loss, epsilon graph:**
![DQN_batch.png](img/DQN_batch.png)

**conclude:**
From the graph, we observe that batch size does not significantly impact performance in this setup.
All three values (64, 128, 256) lead to similar rewards, steps, and loss trends.
This suggests that within this range, the DQN agent is robust to changes in batch size, and no clear advantage is observed for larger batches.

##### experiment 4: changing dropout
Dropout is a regularization technique used to prevent overfitting by randomly deactivating neurons during training.

In this experiment have change dropout to 2 value [0.4, 0.5]

**Hypothesis:** A lower dropout rate keeps more neurons active, which may speed up learning but risks overfitting.

**Reward, step, loss, epsilon graph:**
![DQN_dropout.png](img/DQN_dropout.png)

**conclude:**
From the graph, the lower dropout (0.4) performs slightly better in terms of reward and step count.
This is likely because more neurons remain active, allowing the network to learn more effectively.

However, very low dropout values could lead to overfitting, especially in more complex tasks.

##### experiment 5: changing tau
Tau is use for controls how fast the target network follows the policy network.

In this experiment have change tau to 2 value [0.01, 0.005]

**Hypothesis:** Higher tau updates the target network more aggressively, which may introduce bias and unstable reward.

**Reward, step, loss, epsilon graph:**
![DQN_tau.png](img/DQN_tau.png)

**conclude:**
The graph shows that higher tau (0.01) results in more unstable rewards in later episodes.
This supports the hypothesis that faster target updates introduce more bias, making learning less stable.

##### experiment 6: changing learning rate
Learning rate is using for controls how much new model override existing weights during optimization.

In this experiment have change learning rate to 2 value [0.0001, 0.001]

**Hypothesis:** Changing learning rate higher make faster learning but higher variance.

**Reward, step, loss, epsilon graph:**
![DQN_lr.png](img/DQN_lr.png)

**conclude:**
A higher learning rate (0.001) leads to faster learning early on, but causes more aggressive and unstable updates later.
This results in the agent being less consistent and unable to reach higher reward levels, compared to the more stable and reliable performance of the lower learning rate (0.0001).

#### MC REINFORCE

log data
- reward
- step (episode lenght)
- loss

Hyper parameter
- num_of_action
- action_range
- learning_rate
- n_episodes
- hidden_dim
- discount_factor
- dropout

In the MC REINFORCE experiment, the following hyperparameters were selected for tuning: learning rate, discount factor, dropout.

We did not vary hidden_dim in this experiment, as their impact has already been observed and analyzed in the DQN experiments. Since MC REINFORCE uses a similar neural network structure to DQN, we assume the effect of hidden_dim would be consistent.

##### experiment 1: changing learning rate
Learning rate is using for controls how much new model override existing weights during optimization.

In this experiment have change learning rate to 2 value [0.0001, 0.001]

**Hypothesis:** Changing learning rate higher make faster learning but higher variance.

**Reward, step, loss, epsilon graph:**
![MC_lr.png](img/MC_lr.png)

**conclude:**

A higher learning rate (0.001) led to faster and stronger learning, with the agent reaching significantly higher average rewards compared to 0.0001.
However, it also introduced more variance in training.

Compared to DQN, MC REINFORCE appears more sensitive to learning rate—the lower value (0.0001) resulted in very slow learning and poor performance.

##### experiment 2: changing discount factor
Discount factor determines how much the agent values future rewards compared to immediate ones.

In this experiment have change discount factor to 2 value [0.95, 0.99]

Hypothesis: A higher discount factor (0.99) encourages long-term reward planning, leading to potentially higher returns, but may cause greater variance in learning

Reward, step, loss, epsilon graph:
![MC_d.png](img/MC_d.png)

conclude: 

Initially, both discount factors performed similarly. However, over time, the higher discount factor (0.99) achieved better long-term rewards.However, it also made the training less stable, with ups and downs and some big drops in reward.

This matches the hypothesis that a higher discount factor helps the agent plan for the future, but can also make learning more unstable.

##### experiment 3: changing dropout
Dropout is a regularization technique used to prevent overfitting by randomly deactivating neurons during training.

In this experiment have change dropout to 2 value [0.4, 0.5]

**Hypothesis:** A lower dropout rate keeps more neurons active, which may speed up learning but risks overfitting.

**Reward, step, loss, epsilon graph:**
![MC_dropout.png](img/MC_dropout.png)

**conclude:**
The graph shows that changing dropout between 0.4 and 0.5 had minimal effect on learning performance.
This suggests that MC REINFORCE is robust to minor dropout changes, or that the dropout-affected neurons may not have played a significant role in decision-making.

#### A2C

log data
- reward
- step (episode lenght)
- actor loss
- critic loss

Hyper parameter
- num_of_action
- action_range
- learning_rate
- hidden_dim
- n_episodes
- discount_factor
- batch_size
- buffer_size
- entropy_coef

In the A2C experiment, the following hyperparameters were selected for tuning: learning rate, discount factor, entropy_coef, hidden_dim.

##### experiment 1: changing learning rate
Learning rate is using for controls how much new model override existing weights during optimization.

In this experiment have change learning rate to 2 value [0.00001, 0.00005]

**Hypothesis:** Changing learning rate higher make faster learning but higher variance.

**Reward, step, loss, epsilon graph:**
![A2C_lr.png](img/A2C_lr.png)

**conclude:**
A higher learning rate (0.00005) helped the agent learn faster.
However, there was no big difference in stability between the two values.
This shows that A2C can learn well with both low and high learning rates.

##### experiment 2: changing discount factor
Discount factor determines how much the agent values future rewards compared to immediate ones.

In this experiment have change discount factor to 2 value [0.95, 0.99]

Hypothesis: A higher discount factor (0.99) encourages long-term reward planning, leading to potentially higher returns, but may cause greater variance in learning

Reward, step, loss, epsilon graph:
![A2C_d.png](img/A2C_d.png)

conclude: 
The graph shows almost no difference between 0.95 and 0.99.
This means A2C is robust to small changes in the discount factor.

##### experiment 3: changing entropy_coef
Entropy coeficient is use for encourages exploration by penalizing overconfidence in policy.

In this experiment have change hidden_dim to 2 value [64, 128]

**Hypothesis:** Higher entropy means more exploration, which may slow down learning but help avoid bad policies.

**Reward, step, loss, epsilon graph:**
![A2C_eny.png](img/A2C_en.png)

**conclude:** 
Higher entropy (0.05) made the agent explore more, but slowed down learning.
Lower entropy (0.01) helped the agent learn faster and get higher rewards.
So, too much exploration may hurt performance once the agent is already doing well.

##### experiment 4: changing hidden_dim
Hidden dim is controls capacity of the neural network. A larger hidden dimension allows the model to represent more complex Q-functions and policy-function.

In this experiment have change hidden_dim to 2 value [64, 128]

**Hypothesis:** Increasing hidden_dim improves the model's ability to approximate Q-values and policy, leading to better learning performance.

**Reward, step, loss, epsilon graph:**
![A2C_hidden_dim.png](img/A2C_hidden_dim.png)

**conclude:** 
As hypothesis, the larger hidden size (128) helped the agent learn faster and reach slightly better rewards.
Bigger networks can help, but the difference wasn't huge in this case.

## Part 4: Evaluate Cart-Pole Agent performance.