# TP 4 - Policy Gradient Methods, REINFORCE, baselines (M2AI RL  UP-SACLAY 2024-2025)

### Instructions
This assignement will be an introduction to Policy Gradient methods. We will implement the REINFORCE algorithm and improve the variance with baselines. will combine parametrized values approximations and parametrized policy approximations.  

- Save this notebook in a ```.ipynb``` format and send it to cyriaque.rousselot(at)inria(dot)fr with the name ```TP4_NAME_SURNAME``` before next sunday. Please put in object of your mail ```[RL MASTER TP4]``` followed by your name and surname.
- Make sure to comment your code and explain your decisions clearly. Write explanations in text if necessary
- Answers must be short and precise and don't require thousands lines of code.
- Generally, the code to complete is indicated with the comment ```#TO IMPLEMENT```
Good luck !


In the previous practical session, you have seen that we can replace sampling estimations of the value by parametrized value funcitons. Here, we will look at approaches that learn a parametrized policy :
$$ \pi(a|s;\theta)$$

We will update this policy following gradient ascent, similarly to the previous practical session, with a different objective $J(\theta)$ to maximize:

$$ \theta_{t+1} = \theta_t + \alpha \nabla J(\theta_t)$$

$\nabla J(\theta_t)$ will be replaced by an approximation of the gradient of the performance of the policy.

## Policy Gradient Theorem ( Reminder)



We want to maximise the value from the starting state $s_0$:

$$
J(\theta) = v_{\pi_\theta}(s_0).
$$

where $J(\theta)$ represents the expected return starting from the initial state $s_0$, and $v_{\pi_\theta}(s)$ is the value function under the policy $\pi_\theta$.We have

$$
\nabla_\theta J(\theta) = \nabla_\theta v_{\pi_\theta}(s_0),
$$



Expanding $v_{\pi_\theta}(s_0)$ in terms of the policy $\pi_\theta$, we get:

$$
v_{\pi_\theta}(s_0) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^\infty \gamma^t r_t \, \bigg| \, s_0 \right],
$$

where $\tau = (s_0, a_0, s_1, a_1, \dots)$ is a trajectory generated by the policy $\pi_\theta$, $\gamma$ is the discount factor, and $r_t$ is the reward at time step $t$.
$v_{\pi_\theta}(s_0)$ can be expressed in terms of the trajectory distribution induced by $\pi_\theta$, we have:

$$
v_{\pi_\theta}(s_0) = \int_{\tau} p_{\pi_\theta}(\tau) G(\tau),
$$

where $p_{\pi_\theta}(\tau)$ is the probability of sampling a trajectory $\tau$ under the policy $\pi_\theta$, and $G(\tau) = \sum_{t=0}^\infty \gamma^t r_t$ is the return of the trajectory.

The trajectory probability $p_{\pi_\theta}(\tau)$ can be written as:

$$
p_{\pi_\theta}(\tau) = p(s_0) \prod_{t=0}^\infty \pi_\theta(a_t | s_t) p(s_{t+1} | s_t, a_t),
$$

where $p(s_0)$ is the distribution of the initial state, $\pi_\theta(a_t | s_t)$ is the policy, and $p(s_{t+1} | s_t, a_t)$ is the state transition probability.

The gradient $\nabla_\theta v_{\pi_\theta}(s_0)$ can thus be expressed as:

$$
\nabla_\theta v_{\pi_\theta}(s_0) = \int_{\tau} \nabla_\theta p_{\pi_\theta}(\tau) G(\tau).
$$

Using the differential logarithm trick :

$$
\nabla_\theta p_{\pi_\theta}(\tau) = p_{\pi_\theta}(\tau) \nabla_\theta \log p_{\pi_\theta}(\tau),
$$

we obtain:

$$
\nabla_\theta v_{\pi_\theta}(s_0) = \int_{\tau} p_{\pi_\theta}(\tau) \nabla_\theta \log p_{\pi_\theta}(\tau) G(\tau).
$$

The expectation form of this integral is:

$$
\nabla_\theta v_{\pi_\theta}(s_0) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \nabla_\theta \log p_{\pi_\theta}(\tau) G(\tau) \right].
$$

Substituting $p_{\pi_\theta}(\tau) = p(s_0) \prod_{t=0}^\infty \pi_\theta(a_t | s_t) p(s_{t+1} | s_t, a_t)$ into $\log p_{\pi_\theta}(\tau)$, we see that only $\log \pi_\theta(a_t | s_t)$ depends on $\theta$. Thus:


$$
\nabla_\theta v_{\pi_\theta}(s_0) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^\infty \nabla_\theta \log \pi_\theta(a_t | s_t) G(\tau) \right].
$$

Finally, since $J(\theta) = v_{\pi_\theta}(s_0)$, we arrive at the policy gradient theorem:

$$
\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^\infty \nabla_\theta \log \pi_\theta(a_t | s_t) G(\tau) \right].
$$

The policy gradient theorem provides a framework for directly optimizing the policy $\pi_\theta$ using gradient ascent on $J(\theta)$.



## Policy network design

We will decompose in two ways the building of the policy :
+ First, we try a simple linear model $\pi^{softmax}_\theta(a| s)$
+ Second we try a more complex neural network $\pi^{network}_\theta(a| s)$


> If you don't succeed to build $\pi^{softmax}_\theta(a| s)$, you can still implement the REINFORCE algorithm for the following $\pi^{network}_\theta(a| s)$  ( see next part)

### Softmax method

If the action space is discrete and not too large, then a natural and common kind of
parameterization is to form parameterized numerical preferences $h(s, a, \theta) \in R $ for each
state–action pair.

Let's build $h(s, a, \theta)$ using feature representations:
   $$
   \phi(s, a)^\top \theta = h(s, a, \theta)
   $$
and $\pi^{softmax}_\theta(a| s)$:
$$
\pi^{softmax}_\theta(a| s) = \frac{e^{h(s,a,\theta)}}{\sum_b e^{h(s,b,\theta)}}
$$

The actions with the highest preferences in each state are given the
highest probabilities of being selected.

Let's build our $\phi$.

In [1]:
%pip install gymnasium

Collecting gymnasium
  Downloading gymnasium-1.0.0-py3-none-any.whl.metadata (9.5 kB)
Collecting farama-notifications>=0.0.1 (from gymnasium)
  Downloading Farama_Notifications-0.0.4-py3-none-any.whl.metadata (558 bytes)
Downloading gymnasium-1.0.0-py3-none-any.whl (958 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m958.1/958.1 kB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading Farama_Notifications-0.0.4-py3-none-any.whl (2.5 kB)
Installing collected packages: farama-notifications, gymnasium
Successfully installed farama-notifications-0.0.4 gymnasium-1.0.0


In [2]:
from sklearn.kernel_approximation import RBFSampler
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np
import gymnasium as gym
# train phi for the mountain car environment :

n_components = 5
preprocessing_states = Pipeline(
    [
        ("scaler", StandardScaler()),
        ("feature_generation", RBFSampler(n_components=n_components)),
    ]
)

preprocessing_states.fit(
    np.array(
        [
            [
                np.random.uniform(-4.8, 4.8),
                np.random.normal(0, 1),
                np.random.uniform(-0.418, 0.418),
                np.random.normal(0, 1),
                np.random.choice([0, 1, 2]),
            ]
            for _ in range(10000)
        ]
    )
)


def phi(s, a) -> np.ndarray:
    action = np.array(a)
    state = np.array(s)
    return preprocessing_states.transform([np.hstack([state, action])])[0]


Let's use CartPole for now to test our $\phi$

In [3]:
# instantiate the environment
env = gym.make("CartPole-v1", render_mode="rgb_array")
state, _ = env.reset()
action = env.action_space.sample()

print("Example of phi(s,a) ", phi(state, action))


Example of phi(s,a)  [-0.61531422  0.33184415  0.43636739 -0.0449666   0.59365549]


Let's define our h :

In [37]:
def h(a, state, theta):
    return np.dot(phi(state, a), theta)


**Q.1 : Using the previous $\phi$, implement a  function softmax_policy that return an action selected from the  distribution $\pi^{softmax}_\theta(a| s)$ for the CartPole environment and the distribution $\pi^{softmax}_\theta(a| s)$**
$$
\pi^{softmax}_\theta(a| s) = \frac{e^{h(s,a,\theta)}}{\sum_b e^{h(s,b,\theta)}}
$$


In [36]:
theta = np.random.normal(0, 1, size=(n_components))


def softmax_policy(state, h, theta=theta):
    # Compute preferences for both actions (0 and 1 in CartPole)
    actions = [0, 1]
    preferences = np.array([h(a, state, theta) for a in actions])

    # Apply softmax to compute probabilities
    exp_preferences = np.exp(preferences - np.max(preferences))  # Numerical stability
    current_choice_sample = current_choice_sample = exp_preferences / np.sum(exp_preferences)
    return np.random.choice([0, 1], p=current_choice_sample), current_choice_sample


softmax_policy(state, h)

(1, array([0.48692496, 0.51307504]))

**Q.2 Show that  $\nabla_\theta  \log \pi^{softmax}\theta(a| s) = \phi(s,a) - \sum_b \pi^{softmax}_\theta(b| s) \phi(s,b)$**

\begin{align*}
\nabla_\theta \log \pi^{softmax}_\theta(a| s) &= \nabla_\theta \log \left( \frac{e^{\phi(s,a)^\top \theta}}{\sum_b e^{\phi(s,b)^\top \theta}} \right) \\
&= \nabla_\theta \left( \phi(s,a)^\top \theta - \log \left( \sum_b e^{\phi(s,b)^\top \theta} \right) \right) \\
&= \nabla_\theta \phi(s,a)^\top \theta - \nabla_\theta \log \left( \sum_b e^{\phi(s,b)^\top \theta} \right) \\
&= \phi(s,a) - \nabla_\theta \log \left( \sum_b e^{\phi(s,b)^\top \theta} \right).
\end{align*}

For the second term:
\begin{align*}
\nabla_\theta \log \left( \sum_b e^{\phi(s,b)^\top \theta} \right) &= \frac{\nabla_\theta \sum_b e^{\phi(s,b)^\top \theta}}{\sum_b e^{\phi(s,b)^\top \theta}} \\
&= \frac{\sum_b e^{\phi(s,b)^\top \theta} \nabla_\theta \phi(s,b)^\top \theta}{\sum_b e^{\phi(s,b)^\top \theta}} \\
&= \sum_b \pi^{softmax}_\theta(b| s) \phi(s,b).
\end{align*}

Finally:
\begin{align*}
\nabla_\theta \log \pi^{softmax}_\theta(a| s) &= \phi(s,a) - \sum_b \pi^{softmax}_\theta(b| s) \phi(s,b).
\end{align*}

**Q.3 Implement a function that return $\nabla_\theta  \log \pi^{softmax}\theta(a| s)$**

In [35]:
def grad_softmax(s, a):
    choice, pi = softmax_policy(s, h, theta)
    phi_a = phi(s, a)
    phi_sum = sum(pi[b] * phi(s, b) for b in range(len(pi)))
    return phi_a - phi_sum


print("Gradient", grad_softmax(state, action))

Gradient [-0.45392287  0.63473823  0.6478408   0.28762471  0.00831014]


## Neural Network policy

Definition of the network

In [32]:
import torch
import torch.nn as nn
import torch.nn.functional as F


class PolicyNetwork(nn.Module):
    def __init__(self, state_dim, action_dim, hidden_dim=128):
        super(PolicyNetwork, self).__init__()
        # Define a simple feedforward network
        self.fc1 = nn.Linear(state_dim, hidden_dim)  # First fully connected layer
        self.fc2 = nn.Linear(hidden_dim, action_dim)  # Output layer for action logits

    def forward(self, state):
        """
        Forward pass to compute action probabilities.
        Args:
            state (torch.Tensor): The input state tensor of shape (batch_size, state_dim).
        Returns:
            torch.Tensor: Action probabilities of shape (batch_size, action_dim).
        """
        x = F.relu(self.fc1(state))  # Apply ReLU activation to the first layer
        logits = self.fc2(x)  # Compute action logits
        action_probs = F.softmax(logits, dim=-1)  # Convert logits to probabilities
        return action_probs


Apply the policy network to the environment to get probabilities

In [33]:
import gymnasium as gym
import torch

# Create CartPole environment
env = gym.make("CartPole-v1")

# Initialize the policy network
state_dim = env.observation_space.shape[0]  # Dimension of the state space
action_dim = env.action_space.n  # Number of discrete actions
policy_net = PolicyNetwork(state_dim, action_dim)

# Example forward pass
state, _ = env.reset()  # Reset the environment to get the initial state
torch_state = torch.FloatTensor(state).unsqueeze(0)  # Convert state to tensor with batch dimension
action_probs = policy_net(torch_state)

# Print action probabilities and choose an action using torch.distributions.Categorical
print("Action probabilities:", action_probs)
action_distribution = torch.distributions.Categorical(action_probs)
action = action_distribution.sample()
print("Action chosen", action.item())


Action probabilities: tensor([[0.5310, 0.4690]], grad_fn=<SoftmaxBackward0>)
Action chosen 0


## REINFORCE

The pronciple of the REINFORCE algorithm is this following :

+ Collect trajectories using the current policy $\pi_\theta$.
+ Compute the returns $G_t = \sum_{t'} \gamma^{t'-t} r_{t'}$.  
+ Apply the policy gradient update:

$$
\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \nabla_\theta \log \pi_\theta(a_t | s_t) G(\tau) \right],
$$

$$
\theta_{t+1} = \theta_{t} + \alpha G_t  \nabla_\theta  \log \pi_\theta(a_t | s_t)
$$


 ![reinforce_algorithm](REINGORCE.png)

**Q.4 Build a function train_reinforce to solve the environment CartPole using the $\pi^{softmax}_\theta(a| s)$ policy updated using REINFORCE**

In [38]:
import gymnasium as gym
import numpy as np


# Function to compute discounted returns
def compute_returns(rewards, gamma=0.99):
    discounted_returns = []
    G = 0
    for reward in reversed(rewards):
        G = reward + gamma * G
        discounted_returns.insert(0, G)
    return discounted_returns

def train_reinforce(env_name="CartPole-v1", episodes=1000, gamma=0.99, lr=1e-3):
    # Initialize environment and parameters of the softmax policy
    env = gym.make(env_name)
    theta = np.random.normal(0, 1, size=(n_components))
    for episode in range(episodes):
        state, _ = env.reset()

        rewards = []
        states = []
        actions = []


        # Collect a single trajectory
        done = False
        while not done:
            action,probs = softmax_policy(state, h, theta) # TODO : Select an action using the softmax policy
            next_state, reward, done, _, _ = env.step(action)
            rewards.append(reward)
            states.append(state)
            actions.append(action)
            state = next_state

        # Compute discounted returns
        returns = compute_returns(rewards, gamma)
        returns = np.array(returns)

        # Normalize returns for numerical stability
        returns = (returns - returns.mean()) / (returns.std() + 1e-9)

        # TODO Compute policy gradient loss
        policy_gradient_loss = np.zeros_like(theta)
        for t, Gt in enumerate(returns):
          policy_gradient_loss += Gt * grad_softmax(states[t], actions[t])

        # TODO Backward pass: Update the softmax policy weights
        theta += lr * policy_gradient_loss


        # Logging
        total_reward = sum(rewards)
        print(f"Episode {episode + 1}, Total Reward: {total_reward}")

        # Optional: Stop training early if the task is solved
        if total_reward >= 195:  # CartPole-v1 is considered solved at 195
            print("Environment solved!")
            break

    env.close()
    return theta


# Train the policy weights using REINFORCE
trained_policy = train_reinforce()


Episode 1, Total Reward: 84.0
Episode 2, Total Reward: 46.0
Episode 3, Total Reward: 80.0
Episode 4, Total Reward: 77.0
Episode 5, Total Reward: 60.0
Episode 6, Total Reward: 20.0
Episode 7, Total Reward: 54.0
Episode 8, Total Reward: 60.0
Episode 9, Total Reward: 98.0
Episode 10, Total Reward: 27.0
Episode 11, Total Reward: 94.0
Episode 12, Total Reward: 93.0
Episode 13, Total Reward: 70.0
Episode 14, Total Reward: 75.0
Episode 15, Total Reward: 26.0
Episode 16, Total Reward: 70.0
Episode 17, Total Reward: 78.0
Episode 18, Total Reward: 145.0
Episode 19, Total Reward: 99.0
Episode 20, Total Reward: 92.0
Episode 21, Total Reward: 55.0
Episode 22, Total Reward: 29.0
Episode 23, Total Reward: 123.0
Episode 24, Total Reward: 118.0
Episode 25, Total Reward: 74.0
Episode 26, Total Reward: 47.0
Episode 27, Total Reward: 59.0
Episode 28, Total Reward: 83.0
Episode 29, Total Reward: 80.0
Episode 30, Total Reward: 92.0
Episode 31, Total Reward: 84.0
Episode 32, Total Reward: 126.0
Episode 33, T

**Q.5 Build a function train_reinforce to solve the environment CartPole using the $\pi^{network}_\theta(a| s)$ policy updated using REINFORCE**

In [27]:
import gymnasium as gym
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from torch.distributions import Categorical


# Function to sample an action based on probabilities
def select_action(policy_net, state):
    state = torch.FloatTensor(state).unsqueeze(0)  # Add batch dimension
    action_probs = policy_net(state)
    action_distribution = torch.distributions.Categorical(action_probs)
    action = action_distribution.sample()
    return action.item(), action_distribution.log_prob(action)


# Function to compute discounted returns
def compute_returns(rewards, gamma=0.99):
    discounted_returns = []
    G = 0
    for reward in reversed(rewards):
        G = reward + gamma * G
        discounted_returns.insert(0, G)
    return discounted_returns


# Training loop for REINFORCE
def train_reinforce(env_name="CartPole-v1", episodes=1000, gamma=0.99, lr=1e-3):
    # Initialize environment and policy network
    env = gym.make(env_name)
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n
    policy_net = PolicyNetwork(state_dim, action_dim) # TODO : Initialize the policy network
    optimizer = optim.Adam(policy_net.parameters(), lr=lr)

    for episode in range(episodes):
        state, _ = env.reset()
        log_probs = []
        rewards = []

        # Collect a single trajectory
        done = False
        while not done:
            action, log_prob = select_action(policy_net, state)  # TODO : Select an action using the policy network
            next_state, reward, done, _, _ = env.step(action)
            log_probs.append(log_prob)
            rewards.append(reward)
            state = next_state

        # Compute discounted returns
        returns = compute_returns(rewards, gamma)
        returns = torch.FloatTensor(returns)

        # Normalize returns for numerical stability
        returns = (returns - returns.mean()) / (returns.std() + 1e-9)

        # Compute policy gradient loss
        loss = 0
        for log_prob, G in zip(log_probs, returns):
            loss += -log_prob * G # TODO : Compute the policy gradient loss

        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Logging
        total_reward = sum(rewards)
        print(f"Episode {episode + 1}, Total Reward: {total_reward}")

        # Optional: Stop training early if the task is solved
        if total_reward >= 195:  # CartPole-v1 is considered solved at 195
            print("Environment solved!")
            break

    env.close()
    return policy_net


# Train the policy network
trained_policy = train_reinforce()


Episode 1, Total Reward: 23.0
Episode 2, Total Reward: 75.0
Episode 3, Total Reward: 19.0
Episode 4, Total Reward: 14.0
Episode 5, Total Reward: 62.0
Episode 6, Total Reward: 98.0
Episode 7, Total Reward: 27.0
Episode 8, Total Reward: 44.0
Episode 9, Total Reward: 14.0
Episode 10, Total Reward: 27.0
Episode 11, Total Reward: 24.0
Episode 12, Total Reward: 29.0
Episode 13, Total Reward: 40.0
Episode 14, Total Reward: 59.0
Episode 15, Total Reward: 18.0
Episode 16, Total Reward: 22.0
Episode 17, Total Reward: 20.0
Episode 18, Total Reward: 33.0
Episode 19, Total Reward: 13.0
Episode 20, Total Reward: 43.0
Episode 21, Total Reward: 17.0
Episode 22, Total Reward: 22.0
Episode 23, Total Reward: 41.0
Episode 24, Total Reward: 41.0
Episode 25, Total Reward: 13.0
Episode 26, Total Reward: 15.0
Episode 27, Total Reward: 15.0
Episode 28, Total Reward: 20.0
Episode 29, Total Reward: 36.0
Episode 30, Total Reward: 16.0
Episode 31, Total Reward: 15.0
Episode 32, Total Reward: 21.0
Episode 33, Total

**Q.6 Test multiple configurations of networks, and multiple environments ( CartPole, CliffWalking, etc). Analyse the training process and your results ;  write a paragraph on the impact of your  architecture choice and on the impact of changing the environment on the training process.**

Training softmax on CartPole-v1...


NameError: name 'create_preprocessing_states' is not defined

# Baselines

As seen in the course, it is possible to reduce the variance of the estimator by using baselines. We will focus on learning baselines.

$$
\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^\infty \nabla_\theta \log \pi_\theta(a_t | s_t) \left( G_t - b(s_t) \right) \right],
$$
with
- $ G_t $: The return from time $t$.
- $b(s_t)$: A baseline function, often parameterized as  $b(s_t) = V(s_t; \omega)$, where  $V(s_t,\omega)$ is a value network learned to estimate the value.


The update rule of the policy network is given by:

$$
\theta \leftarrow \theta + \alpha \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t | s_t) \left( R_t - b(s_t) \right),
$$

where $T$ is the length of the trajectory.


## Learning a  Baseline

The value network $V(s; \omega)$ is updated to minimize the mean squared error (MSE) between the predicted state value $V(s_t; \omega)$ and the observed return $G_t$. The objective function for the value network is:

$$
L(\omega) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \left( G_t - V(s_t; \omega) \right)^2 \right].
$$

The gradient of the loss function with respect to the value network parameters $\omega$ is:

$$
\nabla_\omega L(\omega) = - \mathbb{E}_{\tau \sim \pi_\theta} \left[ \left( G_t - V(s_t; \omega) \right) \nabla_\omega V(s_t; \omega) \right].
$$

In practice, this gradient is estimated using sampled trajectories, and the value network parameters are updated using gradient descent:

$$
\omega \leftarrow \omega - \beta \nabla_\omega L(\omega),
$$

where:
- $\omega$: Parameters of the value network.
- $\beta$: Learning rate for the value network.

- $V(s_t; \omega)$: The value network's prediction for the state value at $s_t$.

The value network thus learns to approximate the expected return for each state.


Let's define the value network

In [None]:
# Define the Value Network
class ValueNetwork(nn.Module):
    def __init__(self, state_dim, hidden_dim=128):
        super(ValueNetwork, self).__init__()
        self.fc1 = nn.Linear(state_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, 1)

    def forward(self, state):
        x = torch.relu(self.fc1(state))
        value = self.fc2(x)
        return value


**Q.7 Complete this REINFORCE algorithm with a learned baseline.**

![reinforce_algorithm_baseline](REINFORCE_baseline.png)

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np


# Training loop for REINFORCE with baseline
def train_reinforce_with_baseline(
    env_name="CartPole-v1", episodes=1000, gamma=0.99, lr_policy=1e-3, lr_value=1e-3
):
    # Initialize environment, policy network, and value network
    env = gym.make(env_name)
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n
    policy_net = # TODO : Initialize the policy network
    value_net = ValueNetwork(state_dim)

    optimizer_policy = optim.Adam(policy_net.parameters(), lr=lr_policy)
    optimizer_value = optim.Adam(value_net.parameters(), lr=lr_value)

    for episode in range(episodes):
        state, _ = env.reset()
        log_probs = []
        rewards = []
        values = []

        # Collect a single trajectory
        done = False
        while not done:
            value = value_net(
                torch.FloatTensor(state).unsqueeze(0)
            )  # Estimate value of current state
            action, log_prob = # TODO : Select an action using the policy network
            next_state, reward, done, _, _ = env.step(action)

            log_probs.append(log_prob)
            rewards.append(reward)
            values.append(value.squeeze(0))

            state = next_state

        # Compute discounted returns
        returns = compute_returns(rewards, gamma)
        returns = torch.FloatTensor(returns)

        # Compute baseline-adjusted advantages
        values = torch.cat(values)
        advantages = (returns - values).detach()

        # Update value network
        value_loss = # TODO : Compute the value loss (minimize MSE between values and returns)
        optimizer_value.zero_grad()
        value_loss.backward()
        optimizer_value.step()

        # Compute policy gradient loss with baseline
        policy_loss = 0
        for log_prob, advantage in zip(log_probs, advantages):
            policy_loss = # TODO : Compute the policy gradient loss with baselines

        # Update policy network
        optimizer_policy.zero_grad()
        policy_loss.backward()
        optimizer_policy.step()

        # Logging
        total_reward = sum(rewards)
        print(
            f"Episode {episode + 1}, Total Reward: {total_reward}, Value Loss: {value_loss.item()}"
        )

        # Optional: Stop training early if the task is solved
        if total_reward >= 195:
            print("Environment solved!")
            break

    env.close()
    return policy_net, value_net


# Train the policy network with a baseline
trained_policy, trained_value = train_reinforce_with_baseline()


**Q.8 Test multiple configurations of value networks, on multiple environments ( CartPole, CliffWalking, etc). Analyse the training process and your results;  write a paragraph comparing baseline usage with no baseline usage and the impact of the choice of architecture.**