## <center>CSE 546: Reinforcement Learning</center>
### <center>Prof. Alina Vereshchaka</center>
#### <center>Spring 2025</center>

Welcome to the Assignment 3, Part 1: Introduction to Actor-Critic Methods! It includes the implementation of simple actor and critic networks and best practices used in modern Actor-Critic algorithms.

## Section 0: Setup and Imports

In [30]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import gymnasium as gym
import matplotlib.pyplot as plt
from collections import deque
from gymnasium.spaces import Discrete, Box
# Set seed for reproducibility
SEED = 42
np.random.seed(SEED)
torch.manual_seed(SEED)

<torch._C.Generator at 0x108695e90>

## Section 1: Actor-Critic Network Architectures and Loss Computation

In this section, you will explore two common architectural designs for Actor-Critic methods and implement their corresponding loss functions using dummy tensors. These architectures are:
- A. Completely separate actor and critic networks
- B. A shared network with two output heads

Both designs are widely used in practice. Shared networks are often more efficient and generalize better, while separate networks offer more control and flexibility.

---


### Task 1a – Separate Actor and Critic Networks with Loss Function

Define a class `SeparateActorCritic`. Your goal is to:
- Create two completely independent neural networks: one for the actor and one for the critic.
- The actor should output a probability distribution over discrete actions (use `nn.Softmax`).
- The critic should output a single scalar value.

 Use `nn.ReLU()` as your activation function. Include at least one hidden layer of reasonable width (e.g. 64 or 128 units).

```python
# TODO: Define SeparateActorCritic class
```

 Next, simulate training using dummy tensors:
1. Generate dummy tensors for log-probabilities, returns, estimated values, and entropies.
2. Compute the actor loss using the advantage (return - value).
3. Compute the critic loss as mean squared error between values and returns.
4. Use a single optimizer for both the Actor and the Critic. In this case, combine the actor and critic losses into a total loss and perform backpropagation.
5. Use a separate optimizers for both the Actor and the Critic. In this case, keep the actor and critic losses separate and perform backpropagation.

```python
# TODO: Simulate loss computation and backpropagation
```

🔗 Helpful references:
- PyTorch Softmax: https://pytorch.org/docs/stable/generated/torch.nn.Softmax.html
- PyTorch MSE Loss: https://pytorch.org/docs/stable/generated/torch.nn.functional.mse_loss.html

---

In [31]:
# TODO: Define a class SeparateActorCritic with separate networks for actor and critic
import torch.optim as optim

# BEGIN_YOUR_CODE
class SeparateActorCritic(nn.Module):
    def __init__(self, inputLayer,numOfActions, hidden=128):
        super().__init__()
        #Actor Network
        self.actor = nn.Sequential(
            nn.Linear(inputLayer,hidden),
            nn.ReLU(),
            nn.Linear(hidden,numOfActions),
            nn.Softmax(dim=-1) #Softmax as output
        )
        #critic network
        self.critic = nn.Sequential(
            nn.Linear(inputLayer,hidden),
            nn.ReLU(),
            nn.Linear(hidden,1) #Single scalar value as output
        )
    def forward(self,x):
        possible_actions=self.actor(x)
        criticValue = self.critic(x)
        return possible_actions,criticValue
# END_YOUR_CODE

#Simulate loss computation and Backpropogation

batchSize = 3
inputLayer =10
numOfActions = 4
net = SeparateActorCritic(inputLayer,numOfActions)

observations = torch.randn(batchSize,inputLayer)

returns = torch.randn(batchSize, requires_grad=True)
values = torch.randn(batchSize, requires_grad=True)
entropies = torch.randn(batchSize, requires_grad=True)
logProbabilities = torch.randn(batchSize,requires_grad=True)

#Advantage 
advantage = returns-values.detach()

#Actor loss
actorLoss = -(logProbabilities * advantage).mean()-0.01 * entropies.mean()

criterion = nn.MSELoss()
#Critic loss
criticLoss = criterion(values,returns)

#Single optimizer
optimizer = optim.Adam(net.parameters(),lr=0.001)
totalLoss = actorLoss +criticLoss
optimizer.zero_grad()
totalLoss.backward(retain_graph=True)
optimizer.step()



#Seperate Optimizers
#model = SeparateActorCritic(inputLayer,numOfActions)
actorOpt = optim.Adam(net.actor.parameters(),lr=0.001)
criticOpt = optim.Adam(net.critic.parameters(),lr =0.001)

#Actor's step
actorOpt.zero_grad()
actorLoss.backward(retain_graph=True)
actorOpt.step()

#Critic's step
criticOpt.zero_grad(set_to_none=True)
criticLoss.backward()
criticOpt.step()

### Discuss the motivation behind each setup and when it may be preferred in practice.

YOUR ANSWER: This is a classic implementation where we separate Actor and Critic networks. Actor network gives the probabilities over actions, and Critic gives the value -  here we have Asymmetric network for Actor and Critic.
This is prefered in Continuous tasks and helps with better flexibility

### Task 1b – Shared Network with Actor and Critic Heads + Loss Function

Now define a class `SharedActorCritic`:
- Build a shared base network (e.g., linear layer + ReLU)
- Create two heads: one for actor (output action probabilities) and one for critic (output state value)

```python
# TODO: Define SharedActorCritic class
```

Then:
1. Pass a dummy input tensor through the model to obtain action probabilities and value.
2. Simulate dummy rewards and compute advantage.
3. Compute the actor and critic losses, combine them, and backpropagate.

```python
# TODO: Simulate shared network loss computation and backpropagation
```

 Use `nn.Softmax` for actor output and `nn.Linear` for scalar critic output.

🔗 More reading:
- Policy Gradient Methods: https://spinningup.openai.com/en/latest/algorithms/vpg.html
- Actor-Critic Overview: https://www.tensorflow.org/agents/tutorials/6_reinforce_tutorial
- PyTorch Categorical Distribution: https://pytorch.org/docs/stable/distributions.html#categorical

---

In [24]:
# BEGIN_YOUR_CODE

class SharedActorCritic(nn.Module):
  def __init__(self,inputLayer,actionSpace,hiddenLayer=128):
    super().__init__()
    self.sharedBaseLayers = nn.Sequential(
      nn.Linear(inputLayer,hiddenLayer),
      nn.ReLU()
    )
    self.isContinuous = isinstance(actionSpace, gym.spaces.Box)
    if self.isContinuous:
      actionDimenstion = actionSpace.shape[0]
      self.actorMean = nn.Linear(hiddenLayer,actionDimenstion)
      self.actorLogStd = nn.Parameter(
        torch.zeros(
          actionDimenstion
        )
      )
    else:
      self.actorHead = nn.Sequential(
        nn.Linear(
          hiddenLayer,
          actionSpace.n
        ),
        nn.Softmax(dim=-1)
      )
    #Actor's head 
    # self.actorHead = nn.Sequential(
    #   nn.Linear(hiddenLayer,numOfActions),
    #   nn.Softmax(dim=-1)
    # )
    self.criticHead = nn.Linear(hiddenLayer,1) #output state value

  def forward(self,x):
    sharedOutput = self.sharedBaseLayers(x)
    if self.isContinuous:
      mean = self.actorMean(sharedOutput)
      std = self.actorLogStd.exp().expand_as(mean)
      return mean,std,self.criticHead(sharedOutput)
    else:
      actionProbailities = self.actorHead(sharedOutput)
      stateValue = self.criticHead(sharedOutput)
      return actionProbailities,stateValue

#Simulate loss computation and Backpropogation

batchSize = 3
inputLayer =10
#numOfActions = 4
actionSpace = gym.spaces.Discrete(4)
net = SharedActorCritic(inputLayer,actionSpace= actionSpace)

observations = torch.randn(batchSize,inputLayer)
actionProb , stateValues = net(observations)

returns = torch.randn(batchSize, requires_grad=True)
values = torch.randn(batchSize, requires_grad=True)
entropies = torch.randn(batchSize, requires_grad=True)
logProbabilities = torch.randn(batchSize,requires_grad=True)

#Advantage 
advantage = returns-values.detach()

#Actor loss
actorLoss = -(logProbabilities * advantage).mean()-0.01 * entropies.mean()

criterion = nn.MSELoss()
#Critic loss
criticLoss = criterion(values,returns)
totalLoss = actorLoss +criticLoss

#Single optimizer
optimizer = optim.Adam(net.parameters(),lr=0.001)
optimizer.zero_grad()
totalLoss.backward()
optimizer.step()



# END_YOUR_CODE`

### Discuss the motivation behind each setup and when it may be preferred in practice.

YOUR ANSWER: It is primarily preferred in environment with high dimensional observations, here we have a shared base network for input and output is branched to 2 heads - Actor(Policy o/p) and Critic(Value function).
It is preferred when we have high precision requiring tasks.

## Section 2: Auto-Adaptive Network Setup for Environments

You will now create a function that builds a shared actor-critic network that adapts to any Gymnasium environment. This function should inspect the environment and build input/output layers accordingly.

### Task 2: Auto-generate Input and Output Layers
Write a function `create_shared_network(env)` that constructs a neural network using the following rules:
- The input layer should match the environment's observation space.
- The output layer for the **actor** should depend on the action space:
  - For discrete actions: output probabilities using `nn.Softmax`.
  - For continuous actions: output mean and log std for a Gaussian distribution.
- The **critic** always outputs a single scalar value.

```python
# TODO: Define function `create_shared_network(env)`
```

#### Environments to Support:
Test your function with the following environments:
1. `CliffWalking-v0` (Use one-hot encoding for discrete integer observations.)
2. `LunarLander-v3` (Standard Box space for observations and discrete actions.)
3. `PongNoFrameskip-v4` (Use gym wrappers for Atari image preprocessing.)
4. `HalfCheetah-v5` (Continuous observation and continuous action.)

```python
# TODO: Loop through environments and test `create_shared_network`
```

Hint: Use `gym.spaces` utilities to determine observation/action types dynamically.

🔗 Observation/Action Space Docs:
- https://gymnasium.farama.org/api/spaces/

---

In [25]:
import ale_py
#import nn.functional as F
from gymnasium.wrappers import AtariPreprocessing

# BEGIN_YOUR_CODE
def create_shared_network(env, hiddenLayer=128):
  actionSpace= env.action_space 
  observationSpace = env.observation_space
  
  #Check input dimensions 
  if isinstance(observationSpace, gym.spaces.Discrete): inputLayer = observationSpace.n #one hot encoding
  elif isinstance(observationSpace, gym.spaces.Box): inputLayer = int(np.prod(observationSpace.shape))
  else: raise ValueError(f"Environment not supported")

  return SharedActorCritic(inputLayer,actionSpace)
# END_YOUR_CODE
#preprcessing the environment
def preprocessObservation(obs, observationSpace):
  if isinstance(observationSpace, gym.spaces.Discrete):
    oneHotEncoded = np.zeros(observationSpace.n, dtype=np.float32)
    oneHotEncoded[obs] = 1.0
    return torch.tensor(oneHotEncoded).float().unsqueeze(0)
  elif isinstance(observationSpace, gym.spaces.Box):
    flattenedObs = np.array(obs).flatten()
    return torch.tensor(flattenedObs,dtype=torch.float32).unsqueeze(0)
  else:
    raise ValueError(f"Unsupported Environment: {observationSpace}")
  
#Testing on different envs
gym.register_envs(ale_py)
environments = {
  "CliffWalking-v0" : gym.make("CliffWalking-v0"),
  "lunarlander-v3": gym.make("LunarLander-v3"),
  "PongNoFrameskip-v4": gym.make("PongNoFrameskip-v4"),
  #"PongNoFrameskip-v4": AtariPreprocessing(gym.make("PongNoFrameskip-v4")),
  "HalfCheetah-v5": gym.make("HalfCheetah-v5"),
}

print("Testing on varied environments")
for environmentName, environment in environments.items():
  print(f"Testing for {environmentName}")
  net = create_shared_network(environment)
  #print(sharedNetwork)
  observation, _ = environment.reset()
  observationTensor = preprocessObservation(observation, environment.observation_space)
  
  #print(f"Observation shape: {observation.shape}")
  if isinstance(observation, tuple):
    observation = observation[0]
  if isinstance(observation, int):
    print("Observation is an integer, not an array.")
  else:
    print(f"Observation shape: {observation.shape}")
  print(f"Network Architecture: \n{net}")

Testing on varied environments
Testing for CliffWalking-v0
Observation is an integer, not an array.
Network Architecture: 
SharedActorCritic(
  (sharedBaseLayers): Sequential(
    (0): Linear(in_features=48, out_features=128, bias=True)
    (1): ReLU()
  )
  (actorHead): Sequential(
    (0): Linear(in_features=128, out_features=4, bias=True)
    (1): Softmax(dim=-1)
  )
  (criticHead): Linear(in_features=128, out_features=1, bias=True)
)
Testing for lunarlander-v3
Observation shape: (8,)
Network Architecture: 
SharedActorCritic(
  (sharedBaseLayers): Sequential(
    (0): Linear(in_features=8, out_features=128, bias=True)
    (1): ReLU()
  )
  (actorHead): Sequential(
    (0): Linear(in_features=128, out_features=4, bias=True)
    (1): Softmax(dim=-1)
  )
  (criticHead): Linear(in_features=128, out_features=1, bias=True)
)
Testing for PongNoFrameskip-v4
Observation shape: (210, 160, 3)
Network Architecture: 
SharedActorCritic(
  (sharedBaseLayers): Sequential(
    (0): Linear(in_feature

### Discuss the motivation behind each setup and when it may be preferred in practice.

YOUR ANSWER: Here we try to create a Shared A-C network such that it can adapt to different environments - both continuous and discrete. Our model is flexible and is then tested on multiple environments.
We check different observation space and then generate input layer based on the environment  - It is more aligned towards developing Universal RL model that can handle multiple env types

### Task 3: Write Observation Normalization Function
Create a function `normalize_observation(obs, env)` that:
- Checks if the observation space is `Box` and has `low` and `high` attributes.
- If so, normalize the input observation.
- Otherwise, return the observation unchanged.

```python
# TODO: Define `normalize_observation(obs, env)`
```

Test this function with observations from:
- `LunarLander-v3`
- `PongNoFrameskip-v4`

Note: Atari observations are image arrays. Normalize pixel values to [0, 1]. For LunarLander-v3, the different elements in the observation vector have different ranges. Normalize them to [0, 1] using the `low` and `high` attributes of the observation space.


---

In [26]:
# BEGIN_YOUR_CODE
from gymnasium.wrappers import AtariPreprocessing
import ale_py


def normalize_observation(obs,env):
  observationSpace = env.observation_space
  
  if isinstance(observationSpace,gym.spaces.Box): #check observation space type
    #for Atari
    if np.issubdtype(np.integer,obs.dtype): return obs.astype(np.float32)/255.0
    low = observationSpace.low
    high = observationSpace.high
    #using min-max normalization formula
    return (obs -low)/(high-low)
  
  return obs

#Testing 
lunarlanderEnv = gym.make("LunarLander-v3")
observation1, _ = lunarlanderEnv.reset()
normalizedObservation = normalize_observation(observation1, lunarlanderEnv)
print(f"Original  Lunar Lander Observation: {observation1}")
print(f"Lunar Lander Normalized:{normalizedObservation}")

#Testing
#PongEnv
gym.register_envs(ale_py)
pongEnv = gym.make("PongNoFrameskip-v4")
observation2, _ = pongEnv.reset()
normalizedObservation = normalize_observation(observation2, pongEnv)
print(f"Original Pong Observation: {observation2}")
print(f"Pong Normalized:{normalizedObservation}")

  
# END_YOUR_CODE

Original  Lunar Lander Observation: [ 4.2343141e-05  1.4182791e+00  4.2768968e-03  3.2705724e-01
 -4.2300217e-05 -9.6876046e-04  0.0000000e+00  0.0000000e+00]
Lunar Lander Normalized:[0.50000846 0.7836558  0.50021386 0.51635283 0.49999663 0.49995154
 0.         0.        ]
Original Pong Observation: [[[  0   0   0]
  [  0   0   0]
  [  0   0   0]
  ...
  [109 118  43]
  [109 118  43]
  [109 118  43]]

 [[109 118  43]
  [109 118  43]
  [109 118  43]
  ...
  [109 118  43]
  [109 118  43]
  [109 118  43]]

 [[109 118  43]
  [109 118  43]
  [109 118  43]
  ...
  [109 118  43]
  [109 118  43]
  [109 118  43]]

 ...

 [[ 53  95  24]
  [ 53  95  24]
  [ 53  95  24]
  ...
  [ 53  95  24]
  [ 53  95  24]
  [ 53  95  24]]

 [[ 53  95  24]
  [ 53  95  24]
  [ 53  95  24]
  ...
  [ 53  95  24]
  [ 53  95  24]
  [ 53  95  24]]

 [[ 53  95  24]
  [ 53  95  24]
  [ 53  95  24]
  ...
  [ 53  95  24]
  [ 53  95  24]
  [ 53  95  24]]]
Pong Normalized:[[[0.         0.         0.        ]
  [0.         0.

### Discuss the motivation behind each setup and when it may be preferred in practice.

YOUR ANSWER: We normalize the observations from each of the environments.
This is preferred so that our Agent can receive standard input, and we can have consistent training and convergence as well.
We apply min max normalization over continuous data and normalize pixel data to have better understanding of features and avoid instability due to high contrast regions or so.

## Section 4: Gradient Clipping

To prevent exploding gradients, it's common practice to clip gradients before optimizer updates.

### Task 4: Clip Gradients for Actor-Critic Networks
Use dummy tensors and apply gradient clipping with the following PyTorch method:
```python
# During training, after loss.backward():
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=0.5)
```

Reuse the loss computation from Task 1a or 1b. After computing the gradients, apply gradient clipping.
Print the gradient norm before and after clipping to verify it’s applied.

🔗 PyTorch Docs: https://pytorch.org/docs/stable/generated/torch.nn.utils.clip_grad_norm_.html


---

In [27]:
# BEGIN_YOUR_CODE
batchSize = 3
inputLayer =10
numOfActions = 4
#model = SeparateActorCritic(inputLayer,numOfActions)
net = SeparateActorCritic(inputLayer,numOfActions) #1a
#net= SeparateActorCritic(inputLayer,gym.spaces.Discrete(numOfActions)) #1b

optimizer = optim.Adam(net.parameters(),lr=0.001)
#Dummy values 
observations = torch.randn(batchSize,inputLayer)
#Foward pass
actionProbabilities, stateValues = net(observations)
#Dummy values
distibution = torch.distributions.Categorical(actionProbabilities)
actions = distibution.sample()
logProbabilities = distibution.log_prob(actions)
entropies = distibution.entropy()
#Dummy values

returns = torch.randn(batchSize, requires_grad=True)
#Advantage
advantage = returns-stateValues.detach()
#Actor loss
actorLoss = -(logProbabilities * advantage).mean()-0.01 * entropies.mean()
#Critic loss
criterion = nn.MSELoss()
criticLoss = criterion(stateValues.squeeze(-1), returns)
#Total loss
totalLoss = actorLoss + criticLoss
#Backpropagation step
optimizer.zero_grad()
totalLoss.backward()   #gradient computation


print(f"----------------Before Clipping-----------------")

totalGradientBeforeClipping = torch.nn.utils.clip_grad_norm_(net.parameters(), max_norm=float('inf'))
for parameter in net.parameters():
  if parameter.grad is not None:
    print(f"Parameter: {parameter.shape}, Gradient: {parameter.grad}")
  else:
    print(f"Parameter: {parameter.shape}, Gradient: None")
#print(f"Gradient before clipping: {totalGradientBeforeClipping}")

print(f"----------------After Clipping-----------------")

totalGradientAfterClipping = torch.nn.utils.clip_grad_norm_(net.parameters(), max_norm=0.5)
for parameter in net.parameters():
  if parameter.grad is not None:
    print(f"Parameter: {parameter.shape}, Gradient: {parameter.grad}")
  else:
    print(f"Parameter: {parameter.shape}, Gradient: None")
#print(f"Gradient after clipping: {totalGradientAfterClipping.item()}")

optimizer.step() #Optimizer step


----------------Before Clipping-----------------
Parameter: torch.Size([128, 10]), Gradient: tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0129,  0.0087,  0.0187,  ..., -0.0058, -0.0001,  0.0034],
        [-0.0058,  0.0033,  0.0116,  ..., -0.0040, -0.0064,  0.0028],
        ...,
        [ 0.0039, -0.0025, -0.0060,  ...,  0.0019,  0.0009, -0.0012],
        [ 0.0003, -0.0006,  0.0015,  ..., -0.0007, -0.0037,  0.0007],
        [-0.0125,  0.0079,  0.0190,  ..., -0.0064, -0.0029,  0.0036]])
Parameter: torch.Size([128]), Gradient: tensor([ 0.0000e+00,  8.8644e-03,  7.2258e-03,  1.7258e-02, -1.3295e-02,
         9.6985e-03,  2.8820e-03,  9.5387e-04,  5.0316e-05, -4.6311e-04,
         3.7699e-04,  1.2406e-03,  0.0000e+00,  1.8069e-03, -4.0512e-03,
        -3.8925e-03, -3.7481e-03, -1.3704e-03,  0.0000e+00,  0.0000e+00,
        -9.6184e-03,  0.0000e+00, -1.6217e-02, -4.6188e-04, -9.1634e-05,
        -6.6576e-03,  0.0000e+00,  0.0000e+00,  7.3199e-03,  0.0000e

### Discuss the motivation behind each setup and when it may be preferred in practice.

YOUR ANSWER: We use Gradient clipping to prevent the major issue of Gradient Explosion that happens during Backpropogation as Actor wants higher rewards and Critic demands accuracy - In GS, the gradient values become too large and we use the code given to have maximum gradient norm clipped to 0.5 and maintain its range. 
It is preferred to ensure our network updates the weights in a stable manner during the training.

If you are working in a team, provide a contribution summary.
| Team Member | Step# | Contribution (%) |
|---|---|---|
|   1  | Task 1 | 100  |
|   2 | Task 2 | 100  |
|   1 | Task 3 | 100  |
|   2 | Task 4 | 100  |
|   | **Total** |   |
