# **Part 2: Setting up `Cart-Pole` Agent.**


- **`Name`** : **Pavaris Asawakijtananont**

- **`Number`** : **65340500037**

## **Configuration**
#### **Reward Function**
- Including with 5 term of reward the duration of episode can approximate equal reward value

```python
class RewardsCfg:
    """Reward terms for the MDP."""

    # (1) Constant running reward
    alive = RewTerm(func=mdp.is_alive, weight=1.0)
    # (2) Failure penalty
    terminating = RewTerm(func=mdp.is_terminated, weight=-2.0)
    # (3) Primary task: keep pole upright
    pole_pos = RewTerm(
        func=mdp.joint_pos_target_l2,
        weight=-1.0,
        params={"asset_cfg": SceneEntityCfg("robot", joint_names=["cart_to_pole"]), "target": 0.0},
    )
    # (4) Shaping tasks: lower cart velocity
    cart_vel = RewTerm(
        func=mdp.joint_vel_l1,
        weight=-0.01,
        params={"asset_cfg": SceneEntityCfg("robot", joint_names=["slider_to_cart"])},
    )
    # (5) Shaping tasks: lower pole angular velocity
    pole_vel = RewTerm(
        func=mdp.joint_vel_l1,
        weight=-0.005,
        params={"asset_cfg": SceneEntityCfg("robot", joint_names=["cart_to_pole"])},
    )
```

## **Base Class**

##### **q**
- calling the action value with using linear approximator to use with Linear Q Learning 

```python
    def q(self, obs, a=None):
        """Returns the linearly-estimated Q-value for a given state and action."""
        obs_val = obs['policy'][0].detach().cpu().numpy()
        if a==None:
            # Get q values from all action in state
            return np.dot(obs_val, self.w)
        else:
            # Get q values given action & state
            return np.dot(obs_val, self.w[:, a])
        # ====================================== #
```

##### **Scale Action**
```python
    def scale_action(self, action):
        return torch.tensor([[action * ((self.action_range[1] - self.action_range[0]) / (self.num_of_action-1 )) + self.action_range[0]]])
```

##### **Select Action**
- select action bu using deterministic policy by using argument max the action value, and balance the exploration and exploitation  Learning with **$\epsilon - greedy$** with probability to exploration with **$\epsilon$**

```python
    def select_action(self, state):
        """ Select an action based on an epsilon-greedy policy. """
        if np.random.rand() < self.epsilon:
            return np.random.randint(0, self.num_of_action)
        else:
            # Exploitation: choose the action with the highest estimated Q-value
            return np.argmax(self.q(state))
```

##### **Decay Epsilon**
- decaying epsilon to balancing exploration and exploitation

```python
    def decay_epsilon(self):
        """ Decay epsilon value to reduce exploration over time. """
        self.epsilon = max(self.final_epsilon, self.epsilon-self.epsilon_decay)
```

## **Linear Q Learning**


##### **Constructor**

- initial Linear Q Learning class with updating parameter including
    - Learning rate
    - Initial Epsilon
    - Epsilon Decay
    - Final Epsilon
    - Discount Factor
    
```python
class Linear_QN(BaseAlgorithm):
    def __init__(
            self,
            num_of_action: int = 2,
            action_range: list = [-2.5, 2.5],
            learning_rate: float = 0.01,
            initial_epsilon: float = 1.0,
            epsilon_decay: float = 1e-3,
            final_epsilon: float = 0.001,
            discount_factor: float = 0.95,
    ) -> None:
```

##### **Updating**
- updating Linear Q Learning with using the gradient descent by using the gradient by using state
- and error term using maximum action value from next state to set as target value, like a Q learning


```python
    def update(self,obs,action: int,reward: float,next_obs,next_action: int,terminated: bool
    ):
        """
        Updates the weight vector using the Temporal Difference (TD) error 
        in Q-learning with linear function approximation.
        """
        # ========= put your code here ========= #
        q_curr = self.q(obs=obs, a=action)
        if terminated:
            target = reward
        else:
            target = reward + self.discount_factor * np.max(self.q(next_obs))
        pass
    
        error = target - q_curr
        self.training_error.append(error)
        # Gradient descent update
        self.w[:, action] += self.lr * error * obs['policy'][0].detach().cpu().numpy()
```

##### **Learn**
- Set the function to make agent learning with environment by updating every timestep by using observation term as gradient

```python
    def learn(self, env):
        """
        Train the agent on a single step.
        """
        obs, _ = env.reset()
        cumulative_reward = 0.0
        done = False
        step = 0
        while not done:
            action = self.select_action(obs)
            next_obs, reward, terminated, truncated, _ = env.step(self.scale_action(action))
            reward_value = reward.item()
            terminated_value = terminated.item() 
            cumulative_reward += reward_value
            done = terminated or truncated
            self.update(
                obs=obs,
                action=action,
                reward=reward_value,
                next_obs=next_obs,
                next_action=action,
                terminated=terminated_value
            )
            done = terminated or truncated
            obs = next_obs
            step += 1
        self.decay_epsilon()
        return cumulative_reward , step
```

## **Deep Q Network**

##### **Neural Network**
- setup neural network to approximate action value from policy
- this neural consist with 1 hidden layer with fully connected layer
- and forward fucntion to approximate  

```python

class DQN_network(nn.Module):
    """ Neural network model for the Deep Q-Network algorithm. """
    def __init__(self, n_observations, hidden_size, n_actions, dropout):
        super(DQN_network, self).__init__()
        # ========= put your code here ========= #
        self.fc1 = nn.Linear(n_observations, hidden_size) # Input layer
        self.fc2 = nn.Linear(hidden_size, n_actions) # hidden layer
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        """ Forward pass through the network."""
        val = x
        val = F.relu(self.fc1(val))
        val = self.dropout(val)
        val = F.relu(self.fc2(val))
        val = self.dropout(val)

        return val
```

##### **Constructor**

- initial variable for Deep Q Network
    - `tau` : constant for soft update in target network
    - `hidden_dim` : number of neuron in hidden layer
    - `learning_rate` : learning rate to updating gradient
    - `dropout` : probability to black out neuron
    - `buffer_size` : buffer size to collect experience
    - `batch_size` : number of sampling to use to updating network 
    
```python
class DQN(BaseAlgorithm):
    def __init__(
            self,
            device = None,
            num_of_action: int = 2,
            action_range: list = [-2.5, 2.5],
            n_observations: int = 4,
            hidden_dim: int = 64,
            dropout: float = 0.5,
            learning_rate: float = 0.005,
            tau: float = 0.005,
            initial_epsilon: float = 1.0,
            epsilon_decay: float = 1e-3,
            final_epsilon: float = 0.001,
            discount_factor: float = 0.95,
            buffer_size: int = 1000,
            batch_size: int = 1,
    ) -> None:
```



##### **Calculate Loss**

-  Calculate DQN loss with following the equation

$$
L = (y_j +\gamma \max_{a'}Q(\phi_{j+1} , a' ; \theta))^2
$$

```python
    def calculate_loss(self, non_final_mask, non_final_next_states, state_batch, action_batch, reward_batch):
        """ Computes the loss for policy optimization. """
        q = self.policy_net(state_batch).gather(1, action_batch) # [batch_size, 1]
        q_next = torch.zeros(size=(self.batch_size , self.num_of_action), device=self.device)
        if non_final_next_states.size(0) > 0:
            with torch.no_grad():
                q_next_values = self.target_net(non_final_next_states).detach()
                q_next[non_final_mask.squeeze()] = q_next_values # Define Next Q value from next state , squeeze make dimension [batch_size , 1] to [batch_size]
        q_expected = (torch.max(q_next , dim=1)[0].unsqueeze(1) * self.discount_factor) + reward_batch # Find Maximum Q Value over action : Dimension
        loss = F.mse_loss(target=q_expected,input=q) # tensor(0.6990, device='cuda:0', grad_fn=<MseLossBackward0>)
        return loss
```

##### **Generate Sample**
- generate random sample(contain with state transition) with number of batch size to used for updating  

```python
    def generate_sample(self, batch_size):
        """
        Generates a batch sample from memory for training.

        Returns:
            Tuple: A tuple containing:
                - non_final_mask (Tensor): A boolean mask indicating which states are non-final.
                - non_final_next_states (Tensor): The next states that are not terminal.
                - state_batch (Tensor): The batch of current states.
                - action_batch (Tensor): The batch of actions taken.
                - reward_batch (Tensor): The batch of rewards received.
        """
        # Ensure there are enough samples in memory before proceeding
        # sample for training with batch size
        if len(self.memory) < batch_size:
            return None
        batch = self.memory.sample()         
        # ========= put your code here ========= #)
        state_batch = torch.stack([torch.tensor(batch[i].state, dtype=torch.float) for i in range(self.batch_size)]).to(self.device)
        next_states_batch = torch.stack([torch.tensor(batch[i].next_state, dtype=torch.float) for i in range(self.batch_size)]).to(self.device)
        action_batch = torch.stack([torch.tensor(batch[i].action, dtype=torch.int64) for i in range(self.batch_size)]).to(self.device)
        reward_batch = torch.stack([torch.tensor(batch[i].reward, dtype=torch.float) for i in range(self.batch_size)]).to(self.device)
        non_final_mask = torch.stack([torch.tensor(not batch[i].done, dtype=torch.bool) for i in range(self.batch_size)]).to(self.device)
        non_final_next_states = next_states_batch[non_final_mask]
        # Return All dimension : [batch_size , 1]
        return (non_final_mask.unsqueeze(1), non_final_next_states.squeeze(1), state_batch.squeeze(1), action_batch, reward_batch.unsqueeze(1))
```

##### **Update Policy Network**
- updating policy network using gradient descest by using calculated loss to step the policy

```python
    def update_policy(self):
        if self.memory.__len__() < self.batch_size:
            return
        sample = self.generate_sample(self.batch_size)
        if sample is None:
            return
        non_final_mask, non_final_next_states, state_batch, action_batch, reward_batch = sample
        loss = self.calculate_loss(non_final_mask, non_final_next_states, state_batch, action_batch, reward_batch) # tensor(0.7219, device='cuda:0', grad_fn=<MseLossBackward0>)
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        return loss.item()
```

##### **Update Target Network**
- updating target network with soft updating to make target network not correlate to policy network , we control ratio of policy network and target network weight

```python
    def update_target_networks(self):
        target_net_state_dict = self.target_net.state_dict() # get target network weights
        policy_net_state_dict = self.policy_net.state_dict()
        for key in target_net_state_dict:
            target_net_state_dict[key] = self.tau * policy_net_state_dict[key] + (1.0 - self.tau) * target_net_state_dict[key]
        self.target_net.load_state_dict(target_net_state_dict)
```


## **MC REINFORCE**

##### **Neural Network**

```python
class MC_REINFORCE_network(nn.Module):
    """ Neural network for the MC_REINFORCE algorithm. """

    def __init__(self, n_observations, hidden_size, n_actions, dropout):
        super(MC_REINFORCE_network, self).__init__()
        self.fc1 = nn.Linear(n_observations, hidden_size) # Input layer
        self.fc2 = nn.Linear(hidden_size, n_actions) # hidden layer
        self.softmax = nn.Softmax(dim=1)
        self.dropout = nn.Dropout(dropout)
    def forward(self, x):
        """ Forward pass through the network. """
        x = F.relu(self.fc1(x))
        x = self.dropout(x)
        x = F.relu(self.fc2(x))
        x = self.dropout(x)
        x = self.softmax(x)
        return x

```

##### **Constructor**
- initial value in MC_REINFORCE class most variable is same as Linear Q Learning

```python
class MC_REINFORCE(BaseAlgorithm):
    def __init__(
            self,
            device = None,
            num_of_action: int = 2,
            action_range: list = [-2.5, 2.5],
            n_observations: int = 4,
            hidden_dim: int = 64,
            dropout: float = 0.5,
            learning_rate: float = 0.01,
            discount_factor: float = 0.95,
    ) -> None:
        """
        Initialize the CartPole Agent.
        """     
        self.LR = learning_rate

        self.policy_net = MC_REINFORCE_network(n_observations, hidden_dim, num_of_action, dropout).to(device)
        self.optimizer = optim.AdamW(self.policy_net.parameters(), lr=learning_rate)
        self.device = device
        self.steps_done = 0
        self.episode_durations = []
        super(MC_REINFORCE, self).__init__(
            num_of_action=num_of_action,
            action_range=action_range,
            learning_rate=learning_rate,
            discount_factor=discount_factor,
        )
```

##### **Calculate Return**
- calculate return from reward and discount from discount factor
```python
    def calculate_stepwise_returns(self, rewards):
        """
        Compute stepwise returns for the trajectory.

        Args:
            rewards (list): List of rewards obtained in the episode.
        
        Returns:
            Tensor: Normalized stepwise returns. # Dim = [1]
        """
        stepwise_return = 0
        stepwise_return_arr = []
        for r in reversed(rewards):
            stepwise_return = stepwise_return*self.discount_factor + r
            stepwise_return_arr.append(stepwise_return)
        tensor_norm = F.normalize(input=torch.tensor(list(reversed(stepwise_return_arr))),dim=0)
        return tensor_norm.tolist() # > tensor([-0.1740, -0.1021, 0.3525,  0.4109,  0.4675,  0.5201])

```

##### **Calculate Return**
- Generate trajectory to create sample for update
```python
    def generate_trajectory(self, env):
        """
        Generate a trajectory by interacting with the environment.

        Args:
            env: The environment object.
        
        Returns:
            Tuple: (timestep ,episode_return, stepwise_returns, log_prob_actions, trajectory)
        """
        # ===== Initialize trajectory collection variables ===== #
        # Reset environment to get initial state (tensor)
        # Store state-action-reward history (list)
        # Store log probabilities of actions (list)
        # Store rewards at each step (list)
        # Track total episode return (float)
        # Flag to indicate episode termination (boolean)
        # Step counter (int)
        # ========= put your code here ========= #
        obs , _  = env.reset()
        state_hist = []
        reward_hist = []
        action_hist = []
        log_prob_action_hist = []
        episode_return_hist = 0
        timestep = 0
        cumulative_reward = 0
        done = False
        # ====================================== #
        
        # ===== Collect trajectory through agent-environment interaction ===== #
        # In Episode
        while not done:
            
            # Predict action from the policy network
            # State into policy to return probability of each action
            prob_each_action = self.policy_net(obs['policy']) # > tensor([[0.1380, 0.1534, 0.1328, 0.1328, 0.1656, 0.1328, 0.1446]],device='cuda:0', grad_fn=<SoftmaxBackward0>)
            # Change to Probability Distribution
            prob_cat = torch.distributions.Categorical(prob_each_action) # > Categorical(probs: torch.Size([1, 7]))
            action_idx = prob_cat.sample() # > tensor([1], device='cuda:0')

            # Execute action in the environment and observe next state and reward
            next_obs, reward, terminated, truncated, _ = env.step(self.scale_action(action_idx))  # Step Environment
            reward_value = reward.item() # > int : 1
            terminated_value = terminated.item() 
            cumulative_reward += reward_value
            done = terminated or truncated

            # Store action log probability reward and trajectory history
            reward_hist.append(reward_value)
            state_hist.append(obs)
            log_prob_action_hist.append(prob_cat.log_prob(action_idx)) # Collect in list and reduce dimension and change to list
            
            # Update state
            obs = next_obs
            timestep += 1
            if done:
                self.plot_durations(timestep)
                break

        # ===== Stack log_prob_actions &  stepwise_returns ===== #
        stepwise_returns = self.calculate_stepwise_returns(rewards=reward_hist)
 
        self.episode_durations.append(timestep)
        self.rewards.append(cumulative_reward)
        return (cumulative_reward , stepwise_returns , log_prob_action_hist , state_hist)
```

#### **Calculating Loss**
```python
    def calculate_loss(self, stepwise_returns, log_prob_actions):
        """
        Compute the loss for policy optimization.
        Args:
            stepwise_returns (List): Stepwise returns for the trajectory. : Dim list = [n]
            log_prob_actions (tensor): Log probabilities of actions taken. : Dim list = [n] : n is tensor contain with prob
        
        Returns:
            Tensor: Computed loss.
        """
        log_probs = torch.stack(log_prob_actions).flatten()
        # print(log_probs.shape)
        # loss = -torch.sum((log_probs * stepwise_returns))
        loss = -(log_probs * stepwise_returns).mean()
        return loss # > tensor(2.5966) : Scalar
```
#### **Updating Policy**

```python
    def update_policy(self, stepwise_returns, log_prob_actions):
        """
        Update the policy using the calculated loss.

        Args:
            stepwise_returns (Tensor): Stepwise returns.
            log_prob_actions (Tensor): Log probabilities of actions taken.
        
        Returns:
            float: Loss value after the update.
        """
        loss = self.calculate_loss(stepwise_returns=stepwise_returns , log_prob_actions=log_prob_actions) # get tensor loss value
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        return loss.item()
```

## **PPO**
```python
import random
import os
from collections import deque, namedtuple
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import os

from torch.distributions.normal import Normal
from torch.nn.functional import mse_loss
from RL_Algorithm.RL_base_function import BaseAlgorithm

class RolloutBuffer():
    def __init__(self , buffer_size , n_envs):
        self.n_envs = n_envs
        self.buffer_size = buffer_size
        self.memory = deque(maxlen=buffer_size)
        self.advantages = torch.tensor((self.buffer_size, self.n_envs), dtype=torch.float32)
    def add(self, state, action, reward, log_prob, values, done):
        # Detach to avoid carrying the computation graph
        self.memory.append((
            state.detach(), 
            action.detach(), 
            reward.detach(), 
            log_prob.detach(),
            values.detach(), 
            done.detach() if isinstance(done, torch.Tensor) else done
        ))
        
    def __len__(self):
        return len(self.memory)
    
    def sample_all_env(self , batch_size:int):
        '''
        Return random Transition with number of batch size from all environment 
        '''
        return random.sample(self.memory, batch_size)
    
    def sample_batch(self , batch_size:int):

        states, actions, rewards, log_probs_old, values, dones = zip(*self.memory)
        
        states        = torch.cat(states, dim=0) # > change to tensor
        actions       = torch.cat(actions, dim=0)
        rewards       = torch.cat(rewards, dim=0)
        log_probs_old = torch.cat(log_probs_old , dim=0)
        values        = torch.cat(values, dim=0)
        dones         = torch.cat(dones, dim=0)
        advantages    = self.advantages.flatten()

        random_indices = torch.randperm(len(states))[:batch_size]
        return states[random_indices] , actions[random_indices] , rewards[random_indices], log_probs_old[random_indices] , values[random_indices] , dones[random_indices] , advantages[random_indices]
    
class Actor(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, learning_rate=1e-4):
        """
        Actor network for policy approximation.

        Args:
            input_dim (int): Dimension of the state space.
            hidden_dim (int): Number of hidden units in layers.
            output_dim (int): Dimension of the action space.
            learning_rate (float, optional): Learning rate for optimization. Defaults to 1e-4.
        """
        super(Actor, self).__init__()

        self.fc1 = nn.Linear(input_dim, hidden_dim) # Input to hidden layer
        self.fc2 = nn.Linear(hidden_dim, hidden_dim) # hidden to hidden layer
        
        self.actor_head = nn.Linear(hidden_dim, output_dim) # hidden layer
        
        self.softmax = nn.Softmax(dim=1)

        self.init_weights()

    def init_weights(self):
        """
        Initialize network weights using Xavier initialization for better convergence.
        """
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.xavier_uniform_(m.weight)  # Xavier initialization
                nn.init.zeros_(m.bias)  # Initialize bias to 0
    def forward(self, state):
        """
        Forward pass for action selection.

        Args:
            state (Tensor): Current state of the environment.

        Returns:
            Tensor: Selected action values.
        """
        x = F.relu(self.fc1(state))
        x = F.relu(self.fc2(x))

        actor_out = F.relu(self.actor_head(x))
        actor_prob = self.softmax(actor_out)

        return actor_prob
    
class Critic(nn.Module):
    def __init__(self, input_dim, hidden_dim, learning_rate=1e-4):
        """
        Actor network for policy approximation.

        Args:
            input_dim (int): Dimension of the state space.
            hidden_dim (int): Number of hidden units in layers.
            output_dim (int): Dimension of the action space.
            learning_rate (float, optional): Learning rate for optimization. Defaults to 1e-4.
        """
        super(Critic, self).__init__()

        self.fc1 = nn.Linear(input_dim, hidden_dim) # Input to hidden layer
        self.fc2 = nn.Linear(hidden_dim, hidden_dim) # hidden to hidden layer
        
        self.critic_head = nn.Linear(hidden_dim , 1)
        
        self.init_weights()

    def init_weights(self):
        """
        Initialize network weights using Xavier initialization for better convergence.
        """
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.xavier_uniform_(m.weight)  # Xavier initialization
                nn.init.zeros_(m.bias)  # Initialize bias to 0

    def forward(self, state):
        """
        Forward pass for action selection.

        Args:
            state (Tensor): Current state of the environment.

        Returns:
            Tensor: Selected action values.
        """
        x = F.relu(self.fc1(state))
        x = F.relu(self.fc2(x))

        critic_out = self.critic_head(x)

        return critic_out
class PPO(BaseAlgorithm):
    def __init__(self, 
                device = None, 
                num_of_action: int = 2,
                action_range: list = [-2.5, 2.5],
                n_observations: int = 4,
                hidden_dim = 256,
                dropout = 0.05, 
                learning_rate: float = 0.01,
                buffer_size: int = 256,
                batch_size: int = 1,
                discount_factor: float = 0.95,
                lamda : float = 1,
                nun_envs : int = 1,
                eps_clip : float = 0.2,
                critic_loss_coeff : float = 0.5,
                entropy_loss_coeff : float = 0.1,
                epoch : int = 20
                ):
        """
        Actor-Critic algorithm implementation.

        Args:
            device (str): Device to run the model on ('cpu' or 'cuda').
            num_of_action (int, optional): Number of possible actions. Defaults to 2.
            action_range (list, optional): Range of action values. Defaults to [-2.5, 2.5].
            n_observations (int, optional): Number of observations in state. Defaults to 4.
            hidden_dim (int, optional): Hidden layer dimension. Defaults to 256.
            learning_rate (float, optional): Learning rate. Defaults to 0.01.
            tau (float, optional): Soft update parameter. Defaults to 0.005.
            discount_factor (float, optional): Discount factor for Q-learning. Defaults to 0.95.
            batch_size (int, optional): Size of training batches. Defaults to 1.
            buffer_size (int, optional): Replay buffer size. Defaults to 256.
        """
        # Feel free to add or modify any of the initialized variables above.
        # ========= put your code here ========= #
        self.device = device
        self.actor = Actor(n_observations, hidden_dim, num_of_action, learning_rate).to(device)
        self.critic = Critic(n_observations, hidden_dim, learning_rate).to(device)
        self.batch_size = batch_size
        self.lamda = lamda
        self.rollout_buffer = RolloutBuffer(buffer_size=buffer_size , n_envs =nun_envs)
        self.discount_factor = discount_factor
        self.eps_clip = eps_clip
        self.num_envs = nun_envs
        self.critic_loss_coeff = critic_loss_coeff
        self.entropy_loss_coeff = entropy_loss_coeff
        self.epoch = epoch
        self.optimizer = optim.AdamW(list(self.actor.parameters()) + list(self.critic.parameters()), lr=learning_rate, amsgrad=True)

        # Experiment with different values and configurations to see how they affect the training process.
        # Remember to document any changes you make and analyze their impact on the agent's performance.

        pass
        # ====================================== #

        super(PPO, self).__init__(
            num_of_action=num_of_action,
            action_range=action_range,
            learning_rate=learning_rate,
            buffer_size=buffer_size,
            batch_size=batch_size,
        )
        # set up matplotlib
        self.is_ipython = 'inline' in matplotlib.get_backend()
        if self.is_ipython:
            from IPython import display

    def select_action(self, prob_each_action, noise=0.0) -> int:
        """
        Selects an action based on the current policy with optional exploration noise.
        
        Args:
        state (Tensor): The current state of the environment. [[n1,n2,n3,n4,..nn]]
        noise (float, optional): The standard deviation of noise for exploration. Defaults to 0.0.

        Returns:
            Tuple[Tensor, Tensor]: 
                - scaled_action: The final action after scaling.
            Tensor:
                - Probabiblity from action : dim : tensor([n])
                - Log probability of the action taken.
                - Entropy of the action distribution.
        """
        # Change to Probability Distribution
        dist = torch.distributions.Categorical(prob_each_action) # > Categorical(probs: torch.Size([1, 7]))
        action_idx = dist.sample() # > tensor([1], device='cuda:0')
        action_prob = dist.probs.gather(1, action_idx.unsqueeze(1)).squeeze(1)  # shape: [num_env]
        log_prob = dist.log_prob(action_idx) 
        entropy = dist.entropy()
        # [num_env] , [num_env , num_action] , [num_env , num_action] , [num_env] 
        return action_idx , action_prob ,log_prob , entropy 

    def scale_action(self, action):
        """
        Maps a discrete action in range [0, n] to a continuous value in [action_min, action_max].

        Args:
            action (int): Discrete action in range [0, n].
            n (int): Number of discrete actions (inclusive range from 0 to n).
        
        Returns:
            torch.Tensor: Scaled action tensor.
        """
        # ========= put your code here ========= #
        # print("----------------")
        # print(action)
        scale_factor = (self.action_range[1] - self.action_range[0]) / (self.num_of_action-1 )
        scaled_action = action * scale_factor + self.action_range[0]
        return scaled_action.view(-1, 1) 
    
    def update_policy(self , memory):
        """
        Update the policy using the calculated loss.

        Returns:
            float: Loss value after the update.
        """
        states, actions, rewards, log_probs_old, values, dones , advantages = memory
        # for _ in range(self.epoch):
        values = self.critic(states).squeeze(-1) # > (batch size * num_env , 1)
        # advantage = self.calculate_advantage(rewards , dones , values.squeeze()) # > [] , [] , []
        
        values = (values-values.mean())/(values.std()+1e-8)

        returns = advantages + values

        probs = self.actor(states)                  # Get new action probabilities.
        dist = torch.distributions.Categorical(probs)
        log_probs_new = dist.log_prob(actions.squeeze())

        # Actor Loss
        ratio = torch.exp(log_probs_new - log_probs_old)
        surr1 = ratio*advantages
        surr2 = torch.clamp(ratio , 1.0-self.eps_clip , 1.0+self.eps_clip)*advantages
        actor_loss = -torch.min(surr1,surr2).mean()

        # Critic Loss
        critic_loss = F.mse_loss(values, returns)

        # Entropy bonus
        entrupy_bonus = dist.entropy().mean()

        # Final Loss
        loss = actor_loss + self.critic_loss_coeff*critic_loss + self.entropy_loss_coeff * entrupy_bonus
        # Perform backpropagation and optimizer step.
        self.optimizer.zero_grad()
        loss.backward()
        # Optionally clip gradients here if needed.
        self.optimizer.step()
        return loss.item() , actor_loss.item() , critic_loss.item() , entrupy_bonus.item()

    def calculate_advantage(self , rewards , dones , last_values):
        states, actions, rewards, log_probs_old, values , dones = zip(*self.rollout_buffer.memory)
        # Convert to tensors
        rewards = torch.stack(rewards).to(self.device)
        values = torch.stack(values).to(self.device).squeeze(2)
        dones = torch.stack(dones).to(self.device).float()
        last_values = last_values.flatten()
        
        # Dimension : [buffer_size , number_envs]
        # torch.Size([500, 64])
        # torch.Size([500, 64])
        # torch.Size([500, 64])
        # torch.Size([64])

        T_step, n_envs = rewards.shape
        advantages = torch.zeros((T_step, n_envs), dtype=torch.float32, device=self.device)
        gae = torch.zeros(n_envs, dtype=torch.float32, device=self.device)

        # print("debug.........")

        gae = torch.zeros(n_envs, dtype=torch.float32, device=self.device)

        for t in reversed(range(T_step)):
            mask = 1.0 - dones[t]  # [n_envs]
            next_value = last_values if t == T_step - 1 else values[t + 1]  # [n_envs]
            delta = rewards[t] + self.discount_factor * next_value * mask - values[t]
            gae = delta + self.discount_factor * self.lamda * mask * gae
            advantages[t] = gae
        advantages = (advantages - advantages.mean())/(advantages.std() + 1e-8)
        self.rollout_buffer.advantages = advantages

    
    # def compute_returns_and_advantage(self , last_values : torch.Tensor , dones : np.ndarray) -> None:
    #     last_values = last_values.clone().cpu().numpy().flatten()
    #     dones = dones.cpu().numpy()
    #     last_gae_lam = 0
    #     for step in reversed(range(self.buffer_size)):
    #         if step == self.buffer_size - 1: # Use real last value
    #             next_non_terminal = 1.0 - dones.astype(np.float32)
    #             next_values = last_values
    #         else:
    #             next_non_terminal = 1.0 - self.episode_starts[step + 1]
    #             next_values = self.values[step + 1]
    #         delta = self.rewards[step] + self.gamma * next_values * next_non_terminal - self.values[step]
    #         last_gae_lam = delta + self.gamma * self.gae_lambda * next_non_terminal * last_gae_lam
    #         self.advantages[step] = last_gae_lam
    #         # print("debugging.....")

    def train(self , env , max_steps = 1000):
        obs , _  = env.reset()
        num_envs = obs['policy'].shape[0]

        steps_per_env = torch.zeros(num_envs, dtype=torch.int, device=obs['policy'].device)
        cumulative_reward_per_env = torch.zeros(num_envs, dtype=torch.float, device=self.device)

        time_step_buffer = deque(maxlen=10)
        reward_buffer = deque(maxlen=10)

        reward_avg = 0
        time_avg = 0

        loss = 0

        cumulative_reward = 0
        done = False
        # ====================================== #

        for step in range(max_steps):
            # Predict action from the policy network
            prob_each_action = self.actor(obs['policy']) 
            action_idx , action_prob , log_prob , entropy  = self.select_action(prob_each_action=prob_each_action) # > tensor([4], device='cuda:0')
            values = self.critic(obs['policy'])
            # Execute action in the environment and observe next state and reward
            next_obs, reward, terminated, truncated, _ = env.step(self.scale_action(action_idx))  # Step Environmentscripts/Function_based/train.py --task Stabilize-Isaac-Cartpole-v0 
            done = torch.logical_or(terminated, truncated)
            # Store the transition in memory
            self.rollout_buffer.add(state=obs['policy'],action=action_idx,reward=reward,log_prob=log_prob,values=values,done=done)
            

            # ====================================== #

            # Update state
            obs = next_obs
            active_envs = torch.logical_not(done)

            steps_per_env[active_envs] += 1
            done_idx = torch.where(done)[0]

            cumulative_reward_per_env += reward
            for index in done_idx:
                time_step_buffer.append(steps_per_env[index].item())
                reward_buffer.append(cumulative_reward_per_env[index].item())
                reward_avg = torch.mean(torch.tensor(reward_buffer, dtype=torch.float))
                time_avg = torch.mean(torch.tensor(time_step_buffer , dtype=torch.float))

            steps_per_env[done_idx] = 0
            cumulative_reward_per_env[done_idx] = 0

        last_val = self.critic(obs['policy'])
        advantage = self.calculate_advantage(reward, dones=done , last_values=last_val) # > [] , [] , []
        memory = self.rollout_buffer.sample_batch(self.batch_size)
        loss , actor_loss , critic_loss , entropy_bonus = self.update_policy(memory=memory)    
        # print("UPDATING POLICY!! ก'w'ก")

        # reward = 0
        # time_avg = 0
        # loss = 0
        return reward_avg , time_avg , loss , actor_loss , critic_loss , entropy_bonus

    def learn(self, env, max_steps=1000):
        """
        Train the agent on a single step.

        Args:
            env: The environment in which the agent interacts.
            max_steps (int): Maximum number of steps per episode.
            num_agents (int): Number of agents in the environment.
            noise_scale (float, optional): Initial exploration noise level. Defaults to 0.1.
            noise_decay (float, optional): Factor by which noise decreases per step. Defaults to 0.99.
        """

        # ===== Initialize trajectory collection variables ===== #
        # Reset environment to get initial state (tensor)
        # Track total episode return (float)
        # Flag to indicate episode termination (boolean)
        # Step counter (int)
        # ========= put your code here ========= #
        reward_avg , timestep_avg , loss , actor_loss , critic_loss , entropy_bonus = self.train(env=env , max_steps=max_steps)
        self.training_error.append(loss)

        return reward_avg , timestep_avg , loss , actor_loss , critic_loss , entropy_bonus
        # self.plot_durations(timestep_avg)

    def save_net_weights(self, path, filename):
        """
        Save weight parameters.
        """
        if not os.path.exists(path):
            os.makedirs(path)
        filepath = os.path.join(path, filename)
        torch.save({
            'actor_state_dict': self.actor.state_dict(),
            'critic_state_dict': self.critic.state_dict(),
        }, filepath)
        
    def load_net_weights(self, path, filename):
        """
        Load weight parameters.
        """
        checkpoint = torch.load(os.path.join(path, filename))
        self.actor.load_state_dict(checkpoint['actor_state_dict'])
        self.critic.load_state_dict(checkpoint['critic_state_dict'])

    # ================================================================================== #
    def plot_durations(self, timestep=None, show_result=False):
        if timestep is not None:
            self.episode_durations.append(timestep)

        plt.figure(1)
        durations_t = torch.tensor(self.episode_durations, dtype=torch.float)
        if show_result:
            plt.title('Result')
        else:
            plt.clf()
            plt.title('Training...')
        plt.xlabel('Episode')
        plt.ylabel('Duration')
        plt.plot(durations_t.numpy())
        # Take 100 episode averages and plot them too
        if len(durations_t) >= 100:
            means = durations_t.unfold(0, 100, 1).mean(1).view(-1)
            means = torch.cat((torch.zeros(99), means))
            plt.plot(means.numpy())

        plt.pause(0.001)  # pause a bit so that plots are updated
        if self.is_ipython:
            if not show_result:
                display.display(plt.gcf())
                display.clear_output(wait=True)
            else:
                display.display(plt.gcf())
    # ================================================================================== #
```

- PPO must add some rollout buffer for 