<a href="https://colab.research.google.com/github/ThomasWong-ST/Intro-to-RL/blob/main/Proximal_Policy_Optimisation_(Playground).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Proximal Policy Optimization Algorithm
We consider a stochastic policy $\pi_\theta(a \mid s)$ with parameters $\theta$
and a value function $V_w(s)$ with parameters $w$. Given a batch of transitions
$\{(s_t, a_t, r_{t+1}, s_{t+1})\}_{t=1}^N$ collected under an old policy
$\pi_{\theta_{\text{old}}}$, the PPO objective is

$$
J(\theta, w)
=
\mathbb{E}_t \Big[
  L_t^{\text{CLIP}}(\theta)
  - c_1 L_t^{\text{VF}}(w)
  + c_2 S[\pi_\theta](s_t)
\Big].
$$

where

1. **Probability ratio**

   $$
   r_t(\theta)
   = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{\text{old}}}(a_t \mid s_t)}.
   $$

2. **Clipped policy (actor) loss**

   $$
   L_t^{\text{CLIP}}(\theta)
   =
   \min\Big(
     r_t(\theta)\,\hat A_t,\;
     \operatorname{clip}\big(r_t(\theta), 1 - \varepsilon, 1 + \varepsilon\big)
     \, \hat A_t
   \Big).
   $$

   Here $\hat A_t$ is an estimator of the advantage function, e.g. from
   Generalized Advantage Estimation (GAE).

3. **Value (critic) loss**

   $$
   L_t^{\text{VF}}(w)
   =
   \big(
     V_w(s_t) - \hat V_t^{\text{targ}}
   \big)^2.
   $$

   Here $\hat V_t^{\text{targ}}$ is a target for $v^\pi(s_t)$, for example
   $$
   \hat V_t^{\text{targ}} = \hat A_t + V_{w_{\text{old}}}(s_t),
   $$
   where $V_{w_{\text{old}}}$ is the value function used to compute advantages.

4. **Entropy bonus (for discrete actions)**

   $$
   S[\pi_\theta](s_t)
   =
   - \sum_{a} \pi_\theta(a \mid s_t)
         \log \pi_\theta(a \mid s_t).
   $$

The full loss used for gradient *descent* is typically

$$
\mathcal{L}(\theta, w)
=
- \mathbb{E}_t\big[ L_t^{\text{CLIP}}(\theta) \big]
+ c_1 \mathbb{E}_t\big[ L_t^{\text{VF}}(w) \big]
- c_2 \mathbb{E}_t\big[ S[\pi_\theta](s_t) \big],
$$

where $c_1 \ge 0$ controls the strength of the value loss,
and $c_2 \ge 0$ controls the strength of the entropy regularization.


### PPO â€“ Training Pipeline (Summary)

We use two neural networks:

- Actor (policy): $\pi_\theta(a \mid s)$ with parameters $\theta$  
- Critic (value): $V_w(s)$ with parameters $w$

---

#### 1. Initialize

- Initialize $\theta$ and $w$
- Set $\theta_{\text{old}} \leftarrow \theta$, $w_{\text{old}} \leftarrow w$

---

#### 2. Rollout (data collection with old parameters)

Using the **old** policy and value function:

- For $t = 0, 1, \dots, T-1$:
  - Observe state $s_t$
  - Sample action from old policy:
    $$
    a_t \sim \pi_{\theta_{\text{old}}}(\cdot \mid s_t)
    $$
  - Execute $a_t$, observe reward $r_{t+1}$ and next state $s_{t+1}$
  - Store:
    - $s_t$, $a_t$, $r_{t+1}$, $s_{t+1}$, done flag
    - $\log \pi_{\theta_{\text{old}}}(a_t \mid s_t)$
    - $V_{w_{\text{old}}}(s_t)$

---

#### 3. Compute advantages and value targets (with old critic)

Using the **old** value network $V_{w_{\text{old}}}$:

- For each $t$, compute value:
  $$
  v_t = V_{w_{\text{old}}}(s_t)
  $$
- Define TD-errors (for GAE):
  $$
  \delta_t = r_{t+1} + \gamma V_{w_{\text{old}}}(s_{t+1}) - V_{w_{\text{old}}}(s_t)
  $$
- Generalized Advantage Estimation (backwards over the trajectory):
  $$
  \hat A_t = \delta_t + \gamma \lambda \hat A_{t+1}
  $$
  (with $\hat A_T = 0$ at terminal)

- Value targets (for critic):
  $$
  \hat V_t^{\text{targ}} = \hat A_t + V_{w_{\text{old}}}(s_t)
  $$

Optionally normalize advantages:
$$
\hat A_t \leftarrow
\frac{\hat A_t - \text{mean}(\hat A)}{\text{std}(\hat A) + \epsilon}
$$

---

#### 4. PPO update (multiple epochs over the same batch)

Set **trainable** parameters:
$$
\theta \leftarrow \theta_{\text{old}}, \quad w \leftarrow w_{\text{old}}
$$

Repeat for several epochs and mini-batches:

1. **Policy ratio**
   $$
   r_t(\theta) =
   \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{\text{old}}}(a_t \mid s_t)}
   $$

2. **Clipped policy loss**
   $$
   L_t^{\text{CLIP}}(\theta) =
   \min\Big(
     r_t(\theta)\,\hat A_t,\;
     \text{clip}(r_t(\theta), 1-\varepsilon, 1+\varepsilon)\,\hat A_t
   \Big)
   $$

3. **Value (critic) loss**
   $$
   L_t^{\text{VF}}(w) =
   \big( V_w(s_t) - \hat V_t^{\text{targ}} \big)^2
   $$

   (Optionally with value clipping, using $V_{w_{\text{old}}}(s_t)$.)

4. **Entropy bonus (discrete actions)**
   $$
   S[\pi_\theta](s_t) =
   - \sum_a \pi_\theta(a \mid s_t)\,\log \pi_\theta(a \mid s_t)
   $$

5. **Total objective (to maximize)**
   $$
   J(\theta, w) =
   \mathbb{E}_t\Big[
     L_t^{\text{CLIP}}(\theta)
     - c_1 L_t^{\text{VF}}(w)
     + c_2 S[\pi_\theta](s_t)
   \Big]
   $$

   In code we usually minimize the **loss**:
   $$
   \mathcal{L}(\theta, w) =
   - \mathbb{E}_t[L_t^{\text{CLIP}}(\theta)]
   + c_1 \mathbb{E}_t[L_t^{\text{VF}}(w)]
   - c_2 \mathbb{E}_t[S[\pi_\theta](s_t)]
   $$

6. **Gradient step**
   - Update $\theta$ and $w$ by gradient descent on $\mathcal{L}(\theta, w)$.

---

#### 5. Update old parameters

After finishing PPO epochs on this batch:

$$
\theta_{\text{old}} \leftarrow \theta, \quad
w_{\text{old}} \leftarrow w
$$

Then go back to Step 2 and collect a new batch with the updated policy.


In [None]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions.normal import Normal
import torch.nn.functional as F

In [None]:
#kappa_alpha, sigma_alpha = 5.0, 1.0
#kappa_xi, sigma_xi = 15.0, 100.0

#alpha[i+1] = alpha[i] - kappa_alpha*alpha[i]*dt + sigma_alpha * np.sqrt(dt) * np.random.randn()
#xi[i+1]    = xi[i]    - kappa_xi   *xi[i]*dt + sigma_xi    * np.sqrt(dt) * np.random.randn()

In [None]:
class NashBrokerTrader:

    def __init__(self):

      # 1. Time Parameters
        self.T = 1.0        # Total time (e.g., 1 day)
        self.dt = 0.01      # Step size
        self.num_steps = int(self.T / self.dt)

        # 2. OU Process Parameters (Alpha and Xi)
        # (Fill these in based on your previous message)
        self.kappa_alpha = 5.0
        self.sigma_alpha = 1.0
        self.kappa_xi = 15.0
        self.sigma_xi = 100.0
        self.sigma_s = 0.5
        self.a = 0.1
        self.c = 0.1

        # 3. Market Impact Parameters (for dY_t = (h*nu - p*Y)*dt)
        self.h = 1e-3   # Temporary impact coefficient
        self.p = 0.5   # Impact decay rate

        # 4. Reward Parameters
        self.phi = 1 # Running inventory penalty

        # 5. State variables (Placeholder for now)
        self.t = 0
        # We will define the state vector in reset()


    def reset(self):

      """Resets the environment to the starting state."""
      self.t = 0 # start back at initial time
      self.Q_b, self.Y_impact, self.alpha, self.xi_uninformed = 0, 0, 0, 0
      # Return state as a NumPy array (for compatibility with PyTorch later)
      return np.array([self.Q_b, self.Y_impact, self.alpha, self.xi_uninformed], dtype=np.float32)


    def step(self, action):

      nu = action

      # --- 1. Calculate Changes (Deltas) based on CURRENT State (t) ---

      # Alpha & Xi (OU Processes)
      d_alpha = -self.kappa_alpha * self.alpha * self.dt + self.sigma_alpha * np.sqrt(self.dt) * np.random.randn()

      d_xi = -self.kappa_xi * self.xi_uninformed * self.dt + self.sigma_xi * np.sqrt(self.dt) * np.random.randn()

      # Inventory Change (dQ)
      d_Q = (nu - self.xi_uninformed) * self.dt

      # Impact Change (dY)
      d_Y = (self.h * nu - self.p * self.Y_impact) * self.dt

      # Price Change (dS)
      # dS = alpha*dt + dY + noise, noise = sigma * browian_motion**2
      price_noise = self.sigma_s * np.sqrt(self.dt) * np.random.randn()
      d_S = (self.alpha * self.dt) + d_Y + price_noise

      # --- 2. Calculate Reward (Based on t and dt) ---

      # PnL from Trading (Profit from spread - Cost of trading)
      execution_pnl = (self.c * self.xi_uninformed**2 - self.a * nu**2) * self.dt

      # PnL from Holding Inventory (Mark-to-Market change)
      inventory_pnl = self.Q_b * d_S

      # Penalty for Risk (Urgency)
      risk_penalty = -self.phi * (self.Q_b**2) * self.dt

      reward = execution_pnl + inventory_pnl + risk_penalty

      # --- 3. Update State to (t + dt) ---
      self.t += self.dt
      self.alpha += d_alpha
      self.xi_uninformed += d_xi
      self.Q_b += d_Q
      self.Y_impact += d_Y

      # --- 4. Return Step Info ---
      # State Vector: [Inventory, Impact, Alpha, Flow]
      next_state = np.array([self.Q_b, self.Y_impact, self.alpha, self.xi_uninformed], dtype=np.float32)

      # Check termination (e.g., time is up)
      terminated = self.t >= self.T

      return next_state, reward, terminated, False, {}

#Continous Normally Distributed Action Space
### Part 1: The Forward Pass (Making a Decision)

1. **The Environment** presents a **State $x$** (Input)
   * *Example:* `[High Inventory, Price Dropping]`
   
   $\\downarrow$

2. **The Neural Network** processes $x$ to calculate a **Mean $\\mu$**
   * It uses its current weights to decide the "best guess" action.
   * *Result:* `mu = -3.5` ("I think we should sell.")
   
   $\\downarrow$

3. **The Parameter `log_std`** provides the **Noise Level $\\sigma$**
   * This ignores the state $x$. It just asks: "How confident are we globally?"
   * *Result:* `sigma = 1.0` ("But I'm willing to explore $\\pm 1.0$.")
   
   $\\downarrow$

4. **PyTorch** combines $\\mu$ and $\\sigma$ into a **Normal Distribution**
   * *Result:* A Bell Curve centered at -3.5 with width 1.0.
   
   $\\downarrow$

5. **The Agent** samples from this distribution to get the **Action $a$**
   * *Result:* `action = -4.2` (A bit more aggressive than the mean).

---

### Part 2: The Backward Pass (Learning)

6. **The Environment** returns a **Reward** and **Advantage $A$**
   * *Scenario:* The action `-4.2` was excellent! It cleared inventory and made a profit.
   * *Result:* **Positive Advantage ($A > 0$)**.
   
   $\\downarrow$

7. **The Loss Function** compares the Action `-4.2` to the Distribution
   * It sees that `-4.2` was somewhat far from the center `-3.5`.
   * *Goal:* "Make `-4.2` more likely next time!"
   
   $\\downarrow$

8. **The Optimizer** updates the **Network Weights** (shifting $\\mu$)
   * It adjusts the weights so that next time $x$ appears, $\\mu$ will be closer to `-4.2` (e.g., shifts to `-3.8`).
   * *Effect:* The "bullseye" moves toward the successful action.
   
   $\\downarrow$

9. **The Optimizer** updates **`log_std`** (shifting $\\sigma$)
   * It sees that the successful action was in the "tail" of the curve. To make it more likely, it needs to widen the curve.
   * *Effect:* It increases `log_std` slightly, making the agent more willing to explore next time.

In [None]:
class PPOAgent_actor(nn.Module):

  def __init__(self, input_dim, output_dim, hidden_dim = 64):

    super(PPOAgent_actor, self).__init__()

    # 1. The Mean Network (Outputs mu)
    self.network = nn.Sequential(
        nn.Linear(input_dim, hidden_dim),
        nn.Tanh(),  # Tanh is often better for PPO than ReLU
        nn.Linear(hidden_dim, hidden_dim),
        nn.Tanh(),
        nn.Linear(hidden_dim, output_dim)
    )

    # 2. The Learnable Log Std (Exploration Noise)
    # We use log_std so we can take exp() later to ensure std is always positive

    '''Later, when you define your optimizer:
    The optimizer looks at that list and sees the weights of your Linear layers,
    the biases of your Linear layers and self.log_std! It treats log_std exactly
    the same as a weight in a neural network layer. It is just a "weight"
    that isn't connected to any input neurons.'''

    self.log_std = nn.Parameter(torch.zeros(1, output_dim))

  def forward(self, x): # x = [Q^b_t, Y_t, alpha_t, xi_t]
    # Return the Mean
    mu = self.network(x)
    return mu

  def get_dist(self, x):
    # Helper to get the distribution object
    mu = self.forward(x)
    sigma = torch.exp(self.log_std.expand_as(mu))
    return torch.distributions.Normal(mu, sigma)


class PPOAgent_critic(nn.Module):

  def __init__(self, input_dim, hidden_dim = 64):

    super(PPOAgent_critic, self).__init__()

    # 1. The Value Network
    self.network = nn.Sequential(
        nn.Linear(input_dim, hidden_dim),
        nn.Tanh(),
        nn.Linear(hidden_dim, hidden_dim),
        nn.Tanh(),
        nn.Linear(hidden_dim, 1)
    )

  def forward(self, x):
    return self.network(x)



In [None]:
class RolloutBuffer:

    def __init__(self):
        self.states = []
        self.actions = []
        self.log_probs = []
        self.rewards = []
        self.dones = []       # NEW: Needed to handle episode endings
        self.values = []      # NEW: The Critic's prediction V(s)

    def add(self, state, action, log_prob, reward, done, value):
        self.states.append(state)
        self.actions.append(action)
        self.log_probs.append(log_prob)
        self.rewards.append(reward)
        self.dones.append(done)
        self.values.append(value)

    def clear(self):
        self.states = []
        self.actions = []
        self.log_probs = []
        self.rewards = []
        self.dones = []
        self.values = []

    def compute_gae(self, last_value, gamma=0.99, lamda=0.95):
        # 1. Setup containers
        advantages = []
        next_value = last_value    # V(s_T+1) (The value of the state after the episode ends)
        next_advantage = 0         # A_T+1 (Always 0 by definition)

        # 2. The Backward Loop
        for i in reversed(range(len(self.rewards))):
            # If the episode ended here (done=1), we shouldn't look into the future.
            # mask becomes 0 if done, 1 otherwise.
            mask = 1 - self.dones[i]

            # Calculate Delta (The 1-step TD Error)
            # delta = r + gamma * V(next) - V(current)
            delta = self.rewards[i] + gamma * next_value * mask - self.values[i]

            # Calculate Advantage (The recursive magic)
            # A_t = delta + gamma * lambda * A_t+1
            advantage = delta + gamma * lamda * next_advantage * mask

            # Store it (We interpret it in reverse order later)
            '''insert(0, val) pushes the new value to the front of the list,
            automatically reversing it back to the correct order'''
            advantages.insert(0, advantage)

            # 3. Update "Next" variables for the next iteration (which is i-1)
            next_value = self.values[i]
            next_advantage = advantage

        return advantages


In [None]:
#Test
env = NashBrokerTrader()
rolloutbuffer = RolloutBuffer()
actor_test = PPOAgent_actor(4, 1)
critic_test = PPOAgent_critic(4)

state = env.reset()
state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0) # Add a batch dimension

'''The output is a random sample from a normal distribution'''

#print(actor_test.get_dist(state_tensor)) #The probability distibution
#print(actor_test.get_dist(state_tensor).sample()) #realisation of a probability distribution

'''The output is a single value'''

print(critic_test(state_tensor))

tensor([[0.0913]], grad_fn=<AddmmBackward0>)


In [None]:
env = NashBrokerTrader()
rolloutbuffer = RolloutBuffer()
actor = PPOAgent_actor(4, 1)
critic = PPOAgent_critic(4)

#Hyperparameters
lr = 3e-4; gamma = 0.99; epochs = 10 #10 updates per batch
lamda = 0.95; epsilon = 0.2; batch_size = 100

actor_optimizer = torch.optim.Adam(actor.parameters(), lr)
critic_optimizer = torch.optim.Adam(critic.parameters(), lr)

num_updates = 100

# --- Layer 1: Setup (Outside loops) ---
state = env.reset()  # Reset ONCE at the very beginning

# Loop for many updates (e.g., 1000 times)
for update in range(num_updates):

    # --- Layer 2: Data Collection (The Rollout) ---
    for step in range(batch_size):

        # === LAYER 3: YOUR CODE GOES HERE ===
        # 0. Optimization: Turn off gradients
        with torch.no_grad():
            # 1. Turn state to tensor
            state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0)
            # 2. Get action from Actor
            distribution = actor.get_dist(state_tensor)
            action = distribution.sample()
            # 2.1. We sum() just in case action_dim > 1 (not strictly needed here but good practice)
            log_prob = distribution.log_prob(action).sum()
            # 3. Get value from Critic
            value = critic(state_tensor)
        # 4. Step the environment
        # action is a tensor like [[-2.5]], we need the float -2.5
        next_state, reward, done, _, _ = env.step(action.item())
        # ====================================

        # --- Layer 4: Storage & Cleanup (Crucial!) ---
        # You must store the data before moving to the next step!
        rolloutbuffer.add(state, action, log_prob, reward, done, value)

        # Update state for the next loop iteration
        state = next_state

        if done:
            state = env.reset()

    # --- Layer 5: The PPO Update (After the batch is full) ---
    # 1. Bootstrap: Get value of the very last state (to handle the "future" for the last step)
    with torch.no_grad():
        next_value = critic(torch.tensor(state, dtype=torch.float32))
    # 2. Calculate Advantages
    # (Use the method you wrote in RolloutBuffer)
    advantages = rolloutbuffer.compute_gae(next_value, gamma, lamda)
    # 3. Prepare Data
    # Convert python lists to one big tensor for training
    # (flattening isn't strictly necessary if batch_size is small, but good practice)
    tensor_states = torch.tensor(rolloutbuffer.states, dtype=torch.float32)
    tensor_actions = torch.tensor(rolloutbuffer.actions, dtype=torch.float32)
    tensor_old_log_probs = torch.tensor(rolloutbuffer.log_probs, dtype=torch.float32)
    # TIP: Normalize Advantages (Crucial for stability!)
    tensor_advantages = (tensor_advantages - tensor_advantages.mean()) / (tensor_advantages.std() + 1e-8)
    # Calculate "Returns" (The target for the Critic)
    # Return = Advantage + Value (Simple algebra from A = R - V)
    tensor_returns = tensor_advantages + torch.tensor(rolloutbuffer.values, dtype=torch.float32)

    # 4. PPO Optimization Epochs
    # We update the same batch of data multiple times (epochs)
    for _ in range(epochs):

        # A. Re-evaluate the data with the CURRENT network
        # (We need to see how the policy has changed since collection)
        dist = actor.get_dist(tensor_states)
        new_log_probs = dist.log_prob(tensor_actions).sum(axis=-1)
        entropy = dist.entropy().mean()

        new_values = critic(tensor_states).squeeze()

        # B. Calculate the Ratio r_t(theta)
        # Hint: ratio = exp(new_log - old_log)
        ratio = torch.exp(new_log_probs - tensor_old_log_probs)

        # C. Calculate Surrogate Loss (L_CLIP)
        # Hint: The min() logic we discussed!
        # surrogate1 = ratio * advantage
        # surrogate2 = clamp(ratio, 1-eps, 1+eps) * advantage
        # actor_loss = -min(...)
        # (Note: We use negative because optimizers minimize, but we want to maximize reward)
        surr1 = ratio * tensor_advantages
        surr2 = torch.clamp(ratio, 1-epsilon, 1+epsilon) * tensor_advantages

        # FIX: Take the MEAN to get a single scalar loss
        actor_loss = -torch.min(surr1, surr2).mean()

        # D. Calculate Critic Loss (L_VF)
        # MSE between new_values and b_returns
        critic_loss = F.mse_loss(new_values, tensor_returns)

        # E. Total Loss & Backprop
        loss = actor_loss + 0.5 * critic_loss - 0.01 * entropy

        actor_optimizer.zero_grad()
        critic_optimizer.zero_grad()
        loss.backward()
        actor_optimizer.step()
        critic_optimizer.step()

    # 5. Clean up
    rolloutbuffer.clear()


In [None]:
'''env = NashBrokerTrader()
rolloutbuffer = RolloutBuffer()
actor = PPOAgent_actor(4, 1)
critic = PPOAgent_critic(4)

#Hyperparameters
lr = 3e-4; gamma = 0.99; epochs = 10 #10 updates per batch
lamda = 0.95; epsilon = 0.2; batch_size = 100

actor_optimizer = torch.optim.Adam(actor.parameters(), lr)
critic_optimizer = torch.optim.Adam(critic.parameters(), lr)

num_updates = 100

# --- Layer 1: Setup (Outside loops) ---
state = env.reset()  # Reset ONCE at the very beginning

# Loop for many updates (e.g., 1000 times)
for update in range(num_updates):

    # --- Layer 2: Data Collection (The Rollout) ---
    for step in range(batch_size):

        # === LAYER 3: YOUR CODE GOES HERE ===
        # 0. Optimization: Turn off gradients
        with torch.no_grad():
            # 1. Turn state to tensor
            state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0)
            # 2. Get action from Actor
            distribution = actor.get_dist(state_tensor)
            action = distribution.sample()
            # 2.1. We sum() just in case action_dim > 1 (not strictly needed here but good practice)
            log_prob = distribution.log_prob(action).sum()
            # 3. Get value from Critic
            value = critic(state_tensor)
        # 4. Step the environment
        # action is a tensor like [[-2.5]], we need the float -2.5
        next_state, reward, done, _, _ = env.step(action.item())
        # ====================================

        # --- Layer 4: Storage & Cleanup (Crucial!) ---
        # You must store the data before moving to the next step!
        rolloutbuffer.add(state, action, log_prob, reward, done, value)

        # Update state for the next loop iteration
        state = next_state

        if done:
            state = env.reset()

    # --- Layer 5: The PPO Update (After the batch is full) ---
    # 1. Bootstrap: Get value of the very last state (to handle the "future" for the last step)
    with torch.no_grad():
        next_value = critic(torch.tensor(state, dtype=torch.float32).unsqueeze(0))
    # 2. Calculate Advantages
    # (Use the method you wrote in RolloutBuffer)
    advantages = rolloutbuffer.compute_gae(next_value, gamma, lamda)

    # 3. Prepare Data
    # Convert python lists to one big tensor for training
    tensor_states = torch.tensor(np.array(rolloutbuffer.states), dtype=torch.float32)
    tensor_actions = torch.tensor(np.array(rolloutbuffer.actions), dtype=torch.float32)
    tensor_old_log_probs = torch.tensor(np.array(rolloutbuffer.log_probs), dtype=torch.float32)
    # Ensure advantages is a tensor and then flatten if necessary
    tensor_advantages = torch.cat(advantages).squeeze(-1) if advantages and len(advantages[0].shape) > 0 else torch.tensor(advantages, dtype=torch.float32)
    # Calculate "Returns" (The target for the Critic)
    tensor_returns = tensor_advantages + torch.tensor(np.array(rolloutbuffer.values), dtype=torch.float32).squeeze(-1)

    # 4. PPO Optimization Epochs
    # We update the same batch of data multiple times (epochs)
    for _ in range(epochs):

        # A. Re-evaluate the data with the CURRENT network
        # (We need to see how the policy has changed since collection)
        dist = actor.get_dist(tensor_states)
        new_log_probs = dist.log_prob(tensor_actions).sum(axis=-1)
        entropy = dist.entropy().mean()

        new_values = critic(tensor_states).squeeze()

        # B. Calculate the Ratio r_t(theta)
        ratio = torch.exp(new_log_probs - tensor_old_log_probs)

        # C. Calculate Surrogate Loss (L_CLIP)
        clip_ratio = torch.clamp(ratio, 1-epsilon, 1+epsilon)
        actor_loss = -torch.min(ratio * tensor_advantages, clip_ratio * tensor_advantages).mean() # .mean() added here

        # D. Calculate Critic Loss (L_VF)
        critic_loss = F.mse_loss(new_values, tensor_returns)

        # E. Total Loss & Backprop
        loss = actor_loss + 0.5 * critic_loss - 0.01 * entropy

        actor_optimizer.zero_grad()
        critic_optimizer.zero_grad()
        loss.backward()
        actor_optimizer.step()
        critic_optimizer.step()

    # 5. Clean up
    rolloutbuffer.clear()'''

  critic_loss = F.mse_loss(new_values, tensor_returns)
