## <center>CSE 546: Reinforcement Learning</center>
### <center>Prof. Alina Vereshchaka</center>
#### <center>Spring 2025</center>

Welcome to the Assignment 3, Part 1: Introduction to Actor-Critic Methods! It includes the implementation of simple actor and critic networks and best practices used in modern Actor-Critic algorithms.

## Section 0: Setup and Imports

In [11]:
#!pip install numpy
# used in Jupyter to install gymnasium dependencies
#!pip install swig
#!pip install "gymnasium[box2d]"

In [19]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import gymnasium as gym
import matplotlib.pyplot as plt
from collections import deque

# Set seed for reproducibility
SEED = 42
np.random.seed(SEED)
torch.manual_seed(SEED)

<torch._C.Generator at 0x10f5404f0>

## Section 1: Actor-Critic Network Architectures and Loss Computation

In this section, you will explore two common architectural designs for Actor-Critic methods and implement their corresponding loss functions using dummy tensors. These architectures are:
- A. Completely separate actor and critic networks
- B. A shared network with two output heads

Both designs are widely used in practice. Shared networks are often more efficient and generalize better, while separate networks offer more control and flexibility.

---


### Task 1a – Separate Actor and Critic Networks with Loss Function

Define a class `SeparateActorCritic`. Your goal is to:
- Create two completely independent neural networks: one for the actor and one for the critic.
- The actor should output a probability distribution over discrete actions (use `nn.Softmax`).
- The critic should output a single scalar value.

 Use `nn.ReLU()` as your activation function. Include at least one hidden layer of reasonable width (e.g. 64 or 128 units).

```python
# TODO: Define SeparateActorCritic class
```

 Next, simulate training using dummy tensors:
1. Generate dummy tensors for log-probabilities, returns, estimated values, and entropies.
2. Compute the actor loss using the advantage (return - value).
3. Compute the critic loss as mean squared error between values and returns.
4. Use a single optimizer for both the Actor and the Critic. In this case, combine the actor and critic losses into a total loss and perform backpropagation.
5. Use a separate optimizers for both the Actor and the Critic. In this case, keep the actor and critic losses separate and perform backpropagation.

```python
# TODO: Simulate loss computation and backpropagation
```

🔗 Helpful references:
- PyTorch Softmax: https://pytorch.org/docs/stable/generated/torch.nn.Softmax.html
- PyTorch MSE Loss: https://pytorch.org/docs/stable/generated/torch.nn.functional.mse_loss.html

---

In [15]:
# TODO: Define a class SeparateActorCritic with separate networks for actor and critic

# BEGIN_YOUR_CODE

# PyTorch Neural Network tutorials:
# https://pytorch.org/tutorials/beginner/basics/buildmodel_tutorial.html
# https://pytorch.org/tutorials/recipes/recipes/defining_a_neural_network.html
# Learned about nn.Sequential here (linked in Piazza post @580): https://github.com/DLR-RM/stable-baselines3/blob/master/stable_baselines3/common/torch_layers.py

# class SeparateActorCritic inherits class nn.Module https://pytorch.org/docs/stable/generated/torch.nn.Module.html
class SeparateActorCritic(nn.Module):



    # other than self, each parameter represents the dimension of each layer. aka the # of neurons in each layer
    def __init__(self, state_dim, hidden_dim, action_dim):
        super(SeparateActorCritic, self).__init__()

        # actor -------------------------------------

        # nn.Sequential is handy container: https://pytorch.org/docs/stable/generated/torch.nn.Sequential.html
        self.actor = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, action_dim),
            nn.Softmax(dim=1)
        )

        # critic --------------------------
        self.critic = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1)
        )
    # forward propagation; 
    def forward(self, state):
        action_probs = self.actor(state)
        state_value = self.critic(state)
        return action_probs, state_value


#1 ---
batch_size = 100
state_dim = 10
action_dim = 4
hidden_dim = 64
learning_rate = 0.001

actor_critic_instance = SeparateActorCritic(state_dim, action_dim, hidden_dim)

states = torch.randn(batch_size, state_dim)
actions = torch.randint(0, action_dim, (batch_size,))
returns = torch.randn(batch_size)
action_prob_results, state_value_results = actor_critic_instance(states)
values = state_value_results.squeeze()

log_probs = torch.log(action_prob_results)
selected_log_probs = log_probs.gather(1, actions.unsqueeze(1)).squeeze()
entropies = -torch.sum(action_prob_results * log_probs, dim=1)


#2 ---
# https://pytorch.org/docs/stable/generated/torch.Tensor.detach.html -  for preventing backprop
advantages = returns - state_value_results.detach()
actor_loss = -(selected_log_probs * advantages).mean()
print("Actor loss")
print(actor_loss)

#3 ---
critic_loss = F.mse_loss(values, returns.detach())
entropy_bonus = entropies.mean()
print("Critic loss")
print(critic_loss)

#4 ---
# Used Adam optimizer (tip from Piazza post @580) https://pytorch.org/docs/stable/generated/torch.optim.Adam.html
single_optimzer = optimizer = optim.Adam(actor_critic_instance.parameters(), learning_rate)
total_loss = actor_loss + critic_loss - 0.01*entropy_bonus
single_optimzer.zero_grad()
#backpropagation: https://pytorch.org/docs/stable/generated/torch.Tensor.backward.html
single_optimzer.step()


#5 --- two optimizers

actor_optimizer = optim.Adam(actor_critic_instance.actor.parameters(), learning_rate)
actor_optimizer.zero_grad()
actor_loss.backward()
actor_optimizer.step()

critic_optimizer = optim.Adam(actor_critic_instance.critic.parameters(), learning_rate)
critic_optimizer.zero_grad()
critic_loss.backward()
critic_optimizer.step()


# END_YOUR_CODE

Actor loss
tensor(-2.8464, grad_fn=<NegBackward0>)
Critic loss
tensor(1.1790, grad_fn=<MseLossBackward0>)


### Discuss the motivation behind each setup and when it may be preferred in practice.

YOUR ANSWER: The motivation behind a setup with separate actor and critic neural networks is that when you train them independently, the training results of each the actor and the critic are prioritized at the same time, so there is no compromising one's results for the other in a training cycle. This will in theory make the actor and critic better at their own jobs, and may be preferred if the task is small enough or your resources are high enough.

### Task 1b – Shared Network with Actor and Critic Heads + Loss Function

Now define a class `SharedActorCritic`:
- Build a shared base network (e.g., linear layer + ReLU)
- Create two heads: one for actor (output action probabilities) and one for critic (output state value)

```python
# TODO: Define SharedActorCritic class
```

Then:
1. Pass a dummy input tensor through the model to obtain action probabilities and value.
2. Simulate dummy rewards and compute advantage.
3. Compute the actor and critic losses, combine them, and backpropagate.

```python
# TODO: Simulate shared network loss computation and backpropagation
```

 Use `nn.Softmax` for actor output and `nn.Linear` for scalar critic output.

🔗 More reading:
- Policy Gradient Methods: https://spinningup.openai.com/en/latest/algorithms/vpg.html
- Actor-Critic Overview: https://www.tensorflow.org/agents/tutorials/6_reinforce_tutorial
- PyTorch Categorical Distribution: https://pytorch.org/docs/stable/distributions.html#categorical

---

In [16]:
# BEGIN_YOUR_CODE

class SharedActorCritic(nn.Module):
    def __init__(self, state_dim, action_dim, hidden_dim):
        super(SharedActorCritic, self).__init__()

        # base
        self.shared_base = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU()
        )

        # actor head
        self.actor_head = nn.Sequential(
            nn.Linear(hidden_dim, action_dim),
            nn.Softmax(dim=1)
        )

        # critic head
        self.critic_head = nn.Linear(hidden_dim, 1)

    def forward(self, state):
        shared_features = self.shared_base(state)
        action_probs = self.actor_head(shared_features)
        state_value = self.critic_head(shared_features)
        return action_probs, state_value


batch_size = 100
state_dim = 10
action_dim = 4
hidden_dim = 64
learning_rate = 0.001

instance = SharedActorCritic(state_dim, action_dim, hidden_dim)
optimizer = optim.Adam(instance.parameters(), learning_rate)
# dummy data
states = torch.randn(batch_size, state_dim)
actions = torch.randint(0, action_dim, (batch_size,))
returns = torch.randn(batch_size) * 2 + 10

action_probs_results, state_values_results = instance(states)
# https://pytorch.org/docs/stable/generated/torch.squeeze.html
state_values_results = state_values_results.squeeze()
print("Action probabilities:")
print(action_prob_results)
print("State values:")
print(state_value_results)

log_probs = torch.log(action_probs_results)
# https://pytorch.org/docs/stable/generated/torch.gather.html
selected_log_probs = log_probs.gather(1, actions.unsqueeze(1)).squeeze()
entropies = -torch.sum(action_probs_results * log_probs, dim=1)

#3 --- compute advantage, loses
advantages = returns - state_values_results.detach()
print("Advantage:")
print(advantages)
actor_loss = -(selected_log_probs * advantages).mean()
print("Actor loss")
print(actor_loss)
critic_loss = F.mse_loss(state_values_results, returns.detach())
print("Critic loss")
print(critic_loss)
entropy_bonus = entropies.mean()

# combine them
total_loss = actor_loss + critic_loss - 0.01 * entropy_bonus
print("Combined loss")
print(total_loss)

# backpropagate

optimizer.zero_grad()
total_loss.backward()
optimizer.step()


# END_YOUR_CODE

Action probabilities:
tensor([[0.0194, 0.0130, 0.0115,  ..., 0.0135, 0.0086, 0.0130],
        [0.0188, 0.0094, 0.0140,  ..., 0.0108, 0.0109, 0.0136],
        [0.0196, 0.0109, 0.0243,  ..., 0.0108, 0.0051, 0.0133],
        ...,
        [0.0319, 0.0104, 0.0139,  ..., 0.0139, 0.0051, 0.0085],
        [0.0231, 0.0119, 0.0114,  ..., 0.0144, 0.0075, 0.0111],
        [0.0158, 0.0140, 0.0131,  ..., 0.0117, 0.0099, 0.0159]],
       grad_fn=<SoftmaxBackward0>)
State values:
tensor([[0.4887],
        [0.4767],
        [0.4231],
        [0.4230],
        [0.4780],
        [0.4760],
        [0.3652],
        [0.4767],
        [0.4809],
        [0.3440],
        [0.4649],
        [0.4690],
        [0.4559],
        [0.4208],
        [0.4171],
        [0.5588],
        [0.4767],
        [0.2831],
        [0.4276],
        [0.3350],
        [0.4767],
        [0.4767],
        [0.5399],
        [0.4755],
        [0.4767],
        [0.5066],
        [0.5307],
        [0.5739],
        [0.4767],
        [

### Discuss the motivation behind each setup and when it may be preferred in practice.

YOUR ANSWER: In contrast, having a shared network will be a more efficient use of resources and samples, but because loss is combined, a desirable/undesirable result in an actor may be overshadowed by a converse result with greater scale in the critic, or vice versa.

## Section 2: Auto-Adaptive Network Setup for Environments

You will now create a function that builds a shared actor-critic network that adapts to any Gymnasium environment. This function should inspect the environment and build input/output layers accordingly.

### Task 2: Auto-generate Input and Output Layers
Write a function `create_shared_network(env)` that constructs a neural network using the following rules:
- The input layer should match the environment's observation space.
- The output layer for the **actor** should depend on the action space:
  - For discrete actions: output probabilities using `nn.Softmax`.
  - For continuous actions: output mean and log std for a Gaussian distribution.
- The **critic** always outputs a single scalar value.

```python
# TODO: Define function `create_shared_network(env)`
```

#### Environments to Support:
Test your function with the following environments:
1. `CliffWalking-v0` (Use one-hot encoding for discrete integer observations.)
2. `LunarLander-v3` (Standard Box space for observations and discrete actions.)
3. `PongNoFrameskip-v4` (Use gym wrappers for Atari image preprocessing.)
4. `HalfCheetah-v5` (Continuous observation and continuous action.)

```python
# TODO: Loop through environments and test `create_shared_network`
```

Hint: Use `gym.spaces` utilities to determine observation/action types dynamically.

🔗 Observation/Action Space Docs:
- https://gymnasium.farama.org/api/spaces/

---

In [24]:
class SharedActorCritic(nn.Module):
    def __init__(self, state_dim, hidden_dim, action_dim, env_is_discrete):
        super(SharedActorCritic, self).__init__()
        self.state_dim = state_dim
        self.action_dim = action_dim
        self.env_is_discrete = env_is_discrete

        # base
        self.shared_base = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU()
        )

        # actor head
        # The output layer for the actor should depend on the action space -----
          # For discrete actions: output probabilities using nn.Softmax.
        if self.env_is_discrete:
          self.actor_head = nn.Sequential(
                    nn.Linear(hidden_dim, action_dim),
                    nn.Softmax(dim=-1))
        # For continuous actions: output mean and log std for a Gaussian distribution.
        else:
          self.actor_mean = nn.Linear(hidden_dim, self.action_dim)
          self.actor_logstd = nn.Parameter(torch.zeros(self.action_dim))

        # critic head
        self.critic_head = nn.Linear(hidden_dim, 1)



    def forward(self, state):
        shared_features = self.shared_base(state)

        if self.env_is_discrete:
            action_probs = self.actor_head(shared_features)
            action_dist = torch.distributions.Categorical(action_probs)
        else:
            action_mean = self.actor_mean(shared_features)
            action_std = torch.exp(self.actor_logstd)
            action_dist = torch.distributions.Normal(action_mean, action_std)

        state_value = self.critic_head(shared_features)
        return action_dist, state_value


def create_shared_network(env):
    #observation space
    obs_space = env.observation_space
    if isinstance(obs_space, gym.spaces.Discrete):
        state_dim = obs_space.n
        obs_processor = lambda obs: F.one_hot(torch.tensor(obs), num_classes=state_dim).float()
    elif isinstance(obs_space, gym.spaces.Box):
        state_dim = obs_space.shape[0]
        obs_processor = lambda obs: torch.tensor(obs, dtype=torch.float32)
    else:
        raise("Observation space bad")

    # action space
    act_space = env.action_space
    if isinstance(act_space, gym.spaces.Discrete):
        action_dim = act_space.n
        is_discrete_action = True
    elif isinstance(act_space, gym.spaces.Box):
        action_dim = act_space.shape[0]
        is_discrete_action = False
    else:
        raise("action space bad")

    model = SharedActorCritic(state_dim, hidden_dim, action_dim, is_discrete_action)
    return model, obs_processor

def CliffWalking():
# CliffWalking-v0 (Use one-hot encoding for discrete integer observations.)
  env = gym.make("CliffWalking-v0")
  obs, info = env.reset()
  state_dim = env.observation_space.n
  hidden_dim = 128
  action_dim = env.action_space.n
  env_is_discrete = True
  instance = SharedActorCritic(state_dim,hidden_dim,action_dim,env_is_discrete)

  # One-hot: https://pytorch.org/docs/stable/generated/torch.nn.functional.one_hot.html
  my_tensor = torch.tensor(obs, dtype=torch.long)
  one_hot_results = F.one_hot(my_tensor, state_dim).float().unsqueeze(0)

  with torch.no_grad():
      action_probs, state_values = instance(one_hot_results)
  print("CliffWalking-v0 -------------------")
  print("Action probabilities distribution: ")
  print(action_probs.probs)
  print("State value:")
  print(state_values)

def LunarLander():
# LunarLander-v3 (Standard Box space for observations and discrete actions.)
  env = gym.make("LunarLander-v3")
  obs, info = env.reset()
  state_dim = env.observation_space.shape[0]
  hidden_dim = 128
  action_dim = env.action_space.n
  env_is_discrete = True
  instance = SharedActorCritic(state_dim,hidden_dim,action_dim,env_is_discrete)
  my_tensor = torch.tensor(obs, dtype=torch.float32).unsqueeze(0)

  with torch.no_grad():
      action_probs, state_values = instance(my_tensor)

  print("LunarLander-v3 -------------------")
  print("Action probabilities distribution: ")
  print(action_probs.probs)
  print("State values:")
  print(state_values)


tests = [CliffWalking, LunarLander]
for test in tests:
  test()


CliffWalking-v0 -------------------
Action probabilities distribution: 
tensor([[0.2320, 0.2751, 0.2644, 0.2285]])
State value:
tensor([[0.0209]])
LunarLander-v3 -------------------
Action probabilities distribution: 
tensor([[0.2456, 0.2397, 0.2558, 0.2589]])
State values:
tensor([[0.0889]])


### Discuss the motivation behind each setup and when it may be preferred in practice.

YOUR ANSWER: In this case they will be shared because it will be more efficient to train them

### Task 3: Write Observation Normalization Function
Create a function `normalize_observation(obs, env)` that:
- Checks if the observation space is `Box` and has `low` and `high` attributes.
- If so, normalize the input observation.
- Otherwise, return the observation unchanged.

```python
# TODO: Define `normalize_observation(obs, env)`
```

Test this function with observations from:
- `LunarLander-v3`
- `PongNoFrameskip-v4`

Note: Atari observations are image arrays. Normalize pixel values to [0, 1]. For LunarLander-v3, the different elements in the observation vector have different ranges. Normalize them to [0, 1] using the `low` and `high` attributes of the observation space.


---

In [22]:
# BEGIN_YOUR_CODE

def normalize_observation(obs, env):

    if env.observation_space.__class is gym.spaces.Box:
        space = env.observation_space
        if hasattr(space, 'low') and hasattr(space, 'high'):
          # https://numpy.org/doc/2.2/reference/generated/numpy.where.html
            denominator = np.where(space.high - space.low == 0, 1, space.high - space.low)
            normalized = (obs - space.low) / denominator
            return normalized
    return obs

# END_YOUR_CODE

### Discuss the motivation behind each setup and when it may be preferred in practice.

YOUR ANSWER: Normalizing values helps to improve consistency in training

## Section 4: Gradient Clipping

To prevent exploding gradients, it's common practice to clip gradients before optimizer updates.

### Task 4: Clip Gradients for Actor-Critic Networks
Use dummy tensors and apply gradient clipping with the following PyTorch method:
```python
# During training, after loss.backward():
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=0.5)
```

Reuse the loss computation from Task 1a or 1b. After computing the gradients, apply gradient clipping.
Print the gradient norm before and after clipping to verify it’s applied.

🔗 PyTorch Docs: https://pytorch.org/docs/stable/generated/torch.nn.utils.clip_grad_norm_.html


---

In [21]:
# BEGIN_YOUR_CODE

norm_prev = 0.0
for i in actor_critic_instance.parameters():
    if i.grad is not None:
        param_norm = i.grad.data.norm(2)
        norm_prev += param_norm.item() ** 2
norm_prev = norm_prev ** 0.5
print("Gradient norm before clipping:" + str(norm_prev))

torch.nn.utils.clip_grad_norm_(actor_critic_instance.parameters(), max_norm=0.5)

norm_after = 0.0
for j in actor_critic_instance.parameters():
    if j.grad is not None:
        param_norm = j.grad.data.norm(2)
        norm_after += param_norm.item() ** 2
norm_after = norm_after ** 0.5
print("Gradient norm before clipping:" + str(norm_after))


# END_YOUR_CODE

Gradient norm before clipping:1.4824100810824614
Gradient norm before clipping:0.49999966196951495


### Discuss the motivation behind each setup and when it may be preferred in practice.

YOUR ANSWER:

If you are working in a team, provide a contribution summary.
| Team Member | Step# | Contribution (%) |
|---|---|---|
|  Naglis Paunksnis | Task 1 |  100 |
|  Naglis Paunksnis | Task 2 |  100 |
|  Naglis Paunksnis | Task 3 | 100  |
|  Naglis Paunksnis | Task 4 | 100  |
|  Naglis Paunksnis | **Total** | 100  |
