## <center>CSE 546: Reinforcement Learning</center>
### <center>Prof. Alina Vereshchaka</center>
#### <center>Spring 2025</center>

Welcome to the Assignment 3, Part 1: Introduction to Actor-Critic Methods! It includes the implementation of simple actor and critic networks and best practices used in modern Actor-Critic algorithms.

## Section 0: Setup and Imports

In [1]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import gymnasium as gym
import matplotlib.pyplot as plt
from collections import deque

# Set seed for reproducibility
SEED = 42
np.random.seed(SEED)
torch.manual_seed(SEED)

<torch._C.Generator at 0x7fb3c83fe910>

## Section 1: Actor-Critic Network Architectures and Loss Computation

In this section, you will explore two common architectural designs for Actor-Critic methods and implement their corresponding loss functions using dummy tensors. These architectures are:
- A. Completely separate actor and critic networks
- B. A shared network with two output heads

Both designs are widely used in practice. Shared networks are often more efficient and generalize better, while separate networks offer more control and flexibility.

---


### Task 1a – Separate Actor and Critic Networks with Loss Function

Define a class `SeparateActorCritic`. Your goal is to:
- Create two completely independent neural networks: one for the actor and one for the critic.
- The actor should output a probability distribution over discrete actions (use `nn.Softmax`).
- The critic should output a single scalar value.

 Use `nn.ReLU()` as your activation function. Include at least one hidden layer of reasonable width (e.g. 64 or 128 units).

```python
# TODO: Define SeparateActorCritic class
```

 Next, simulate training using dummy tensors:
1. Generate dummy tensors for log-probabilities, returns, estimated values, and entropies.
2. Compute the actor loss using the advantage (return - value).
3. Compute the critic loss as mean squared error between values and returns.
4. Use a single optimizer for both the Actor and the Critic. In this case, combine the actor and critic losses into a total loss and perform backpropagation.
5. Use a separate optimizers for both the Actor and the Critic. In this case, keep the actor and critic losses separate and perform backpropagation.

```python
# TODO: Simulate loss computation and backpropagation
```

🔗 Helpful references:
- PyTorch Softmax: https://pytorch.org/docs/stable/generated/torch.nn.Softmax.html
- PyTorch MSE Loss: https://pytorch.org/docs/stable/generated/torch.nn.functional.mse_loss.html

---

In [25]:
# TODO: Define a class SeparateActorCritic with separate networks for actor and critic

# BEGIN_YOUR_CODE

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

### Actor-Critic with separate networks
class SeparateActorCritic(nn.Module):
    def __init__(self, obs_dim, action_dim, hidden_size=64):
        super(SeparateActorCritic, self).__init__()

        #Actor is used for mapping state to action probabilities
        self.actor = nn.Sequential(
            nn.Linear(obs_dim, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, action_dim),
            nn.Softmax(dim=-1))  #Ensuring output is a valid probability distribution over discrete actions as mentioned

        #Critic for mapping state to scalar value output
        self.critic = nn.Sequential(
            nn.Linear(obs_dim, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, 1))

    def forward(self, x):
        action_probs = self.actor(x)
        value = self.critic(x) #state-value estimate from critic
        return action_probs, value


#Dummy training simulation with observation and action space dimensions
#trying to mimick the cartpole env with 4 features and 2 actions
obs_dim = 4
action_dim = 2
batch_size = 8

#random inputs
dummy_states = torch.randn(batch_size, obs_dim)
model = SeparateActorCritic(obs_dim, action_dim)
print("\n Model has been created successfully with separate actor and critic networks")

#Forward pass propagation
action_probs, values = model(dummy_states)
print("\n Actor output with action probabilities:")
print(action_probs)
print(" Actor outputs valid probabilities ", torch.allclose(action_probs.sum(dim=1), torch.ones(batch_size), atol=1e-3)) #simple verification that each rowis summing to 1

print("\n Critic outputs with state value estimates:")
print(values)
print(" Critic  will output a scalar per input ", values.shape == (batch_size, 1))

#Conducting categorical distribution for sampling actions and caluclating log prob, entropy values
dist = torch.distributions.Categorical(action_probs)
sampled_actions = dist.sample()
log_probs = dist.log_prob(sampled_actions)
entropy = dist.entropy()

print("\n Sampled actions:", sampled_actions.tolist())
print(" Log probabilities:", log_probs)
print(" Entropy values:", entropy)

#Dummy returns for training
dummy_returns = torch.randn(batch_size)
estimated_returns = dummy_returns
advantage = estimated_returns-values.squeeze() #return - value estimate


#Loss Computation of actor critic and total
#here we are using a single optimizer for both actor and critic
optimizer = optim.Adam(model.parameters(), lr=1e-3)

actor_loss = -(log_probs * advantage.detach()).mean()
critic_loss = F.mse_loss(values.squeeze(), dummy_returns)
entropy_bonus = 0.01*entropy.mean()  # this is used for encouraging exploration
total_loss = actor_loss+critic_loss -entropy_bonus

print("\n Actor loss:", actor_loss.item())
print("\nCritic loss:", critic_loss.item())
print("\n Total combined loss:", total_loss.item())

#backward Pass propagtion
optimizer.zero_grad()
total_loss.backward()
optimizer.step()


# END_YOUR_CODE


 Model has been created successfully with separate actor and critic networks

 Actor output with action probabilities:
tensor([[0.5249, 0.4751],
        [0.4818, 0.5182],
        [0.4736, 0.5264],
        [0.5410, 0.4590],
        [0.4904, 0.5096],
        [0.5208, 0.4792],
        [0.5514, 0.4486],
        [0.4871, 0.5129]], grad_fn=<SoftmaxBackward0>)
 Actor outputs valid probabilities  True

 Critic outputs with state value estimates:
tensor([[0.2498],
        [0.0062],
        [0.1243],
        [0.2787],
        [0.2806],
        [0.0134],
        [0.3893],
        [0.3038]], grad_fn=<AddmmBackward0>)
 Critic  will output a scalar per input  True

 Sampled actions: [1, 1, 1, 1, 0, 1, 0, 0]
 Log probabilities: tensor([-0.7442, -0.6574, -0.6417, -0.7787, -0.7125, -0.7357, -0.5954, -0.7193],
       grad_fn=<SqueezeBackward1>)
 Entropy values: tensor([0.6919, 0.6925, 0.6918, 0.6898, 0.6930, 0.6923, 0.6879, 0.6928],
       grad_fn=<NegBackward0>)

 Actor loss: -0.09635810554027557

Cri

### Discuss the motivation behind each setup and when it may be preferred in practice.

YOUR ANSWER:
Making use of separate actor and critic networks gives the agent more flexibility to learn policy and value functions independently because it is useful when the action selection differ from state evaluation, which happens where there might be high stochasticity. We have also learnt that allowing a separate networks help to prevent gradients from interfering across tasks and may improves stability. This setup is mainly useful whenever we use a different learning for each role, that is for actor and critic allowing to have its own feature extraction idependant of its role.

### Task 1b – Shared Network with Actor and Critic Heads + Loss Function

Now define a class `SharedActorCritic`:
- Build a shared base network (e.g., linear layer + ReLU)
- Create two heads: one for actor (output action probabilities) and one for critic (output state value)

```python
# TODO: Define SharedActorCritic class
```

Then:
1. Pass a dummy input tensor through the model to obtain action probabilities and value.
2. Simulate dummy rewards and compute advantage.
3. Compute the actor and critic losses, combine them, and backpropagate.

```python
# TODO: Simulate shared network loss computation and backpropagation
```

 Use `nn.Softmax` for actor output and `nn.Linear` for scalar critic output.

🔗 More reading:
- Policy Gradient Methods: https://spinningup.openai.com/en/latest/algorithms/vpg.html
- Actor-Critic Overview: https://www.tensorflow.org/agents/tutorials/6_reinforce_tutorial
- PyTorch Categorical Distribution: https://pytorch.org/docs/stable/distributions.html#categorical

---

In [26]:
# BEGIN_YOUR_CODE

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.distributions import Categorical


#Shared Actor-Critic Class
class SharedActorCritic(nn.Module):
    def __init__(self, obs_dim, action_dim, hidden_size=64):
        super(SharedActorCritic, self).__init__()

        #Shared hidden layers structured
        self.shared = nn.Sequential(
            nn.Linear(obs_dim, hidden_size),
            nn.ReLU() )
        #Create two heads: one for actor and one for critic

        #Actor head outputs probabilities distribution
        self.actor_head = nn.Sequential(
            nn.Linear(hidden_size, action_dim),
            nn.Softmax(dim=-1))

        #Critic head outputs state value - scalar
        self.critic_head = nn.Linear(hidden_size, 1)

    def forward(self, x):
        shared_out = self.shared(x)
        action_probs = self.actor_head(shared_out)
        state_value = self.critic_head(shared_out)
        return action_probs, state_value


#dummy data for testing istantiation
obs_dim = 4
action_dim = 2
batch_size = 8

#Dummy inputs
dummy_states = torch.randn(batch_size, obs_dim)

model = SharedActorCritic(obs_dim, action_dim)
action_probs, values = model(dummy_states)

#Converts action_probs to a categorical distribution and samples actions
dist = Categorical(action_probs)
actions = dist.sample()
log_probs = dist.log_prob(actions)
entropy = dist.entropy()

#Dummy rewards/returns and advantage
returns = torch.randn(batch_size)
advantage = returns-values.squeeze()

print("\n Action probabilities:")
print(action_probs)
print("\n Sampled actions:", actions.tolist())

print("\n Log probs:", log_probs)
print(" Entropy:", entropy)
print("\n Estimated values:", values.squeeze())
print(" Dummy returns:", returns)
print(" Computed advantage:", advantage)

#Loss Computation of actor, critic and total
actor_loss = -(log_probs * advantage.detach()).mean()
critic_loss = F.mse_loss(values.squeeze(), returns)
entropy_bonus = 0.01*entropy.mean()
total_loss = actor_loss+critic_loss- entropy_bonus

print("\n Actor Loss:", actor_loss.item())
print(" Critic Loss:", critic_loss.item())
print(" Total Loss:", total_loss.item())

#Backward pass porpagation
optimizer = optim.Adam(model.parameters(), lr=1e-3)
optimizer.zero_grad()
total_loss.backward()
optimizer.step()

# END_YOUR_CODE


 Action probabilities:
tensor([[0.5670, 0.4330],
        [0.6325, 0.3675],
        [0.5680, 0.4320],
        [0.5590, 0.4410],
        [0.6758, 0.3242],
        [0.5835, 0.4165],
        [0.5669, 0.4331],
        [0.4981, 0.5019]], grad_fn=<SoftmaxBackward0>)

 Sampled actions: [0, 0, 1, 0, 0, 0, 1, 1]

 Log probs: tensor([-0.5674, -0.4580, -0.8393, -0.5815, -0.3918, -0.5387, -0.8368, -0.6893],
       grad_fn=<SqueezeBackward1>)
 Entropy: tensor([0.6841, 0.6576, 0.6839, 0.6862, 0.6300, 0.6791, 0.6842, 0.6931],
       grad_fn=<NegBackward0>)

 Estimated values: tensor([-0.3125, -0.4235, -0.4264, -0.1971, -0.7151, -0.3151, -0.3972, -0.3350],
       grad_fn=<SqueezeBackward0>)
 Dummy returns: tensor([ 1.7314,  1.0503, -0.5060,  1.4258,  0.3559, -1.2954,  1.1293, -0.4279])
 Computed advantage: tensor([ 2.0440,  1.4738, -0.0795,  1.6230,  1.0710, -0.9803,  1.5265, -0.0930],
       grad_fn=<SubBackward0>)

 Actor Loss: 0.47707635164260864
 Critic Loss: 1.6796519756317139
 Total Loss: 2.1499

### Discuss the motivation behind each setup and when it may be preferred in practice.

YOUR ANSWER:
Making use of a shared network with two heads allows the actor and critic
 maybe computationally cheaper and could promote for generalization where there will be enviornemnts with high-dimensional states like dense images. As the  computation is efficient with shared head-network the agent is expected to learn from shared signals but may fail in noisy environments as it may interfere with policy/value learning, and separate networks are advised then.Faster learning and sharing layers - reduces parameters thereby, reducing memory and reusing features.

## Section 2: Auto-Adaptive Network Setup for Environments

You will now create a function that builds a shared actor-critic network that adapts to any Gymnasium environment. This function should inspect the environment and build input/output layers accordingly.

### Task 2: Auto-generate Input and Output Layers
Write a function `create_shared_network(env)` that constructs a neural network using the following rules:
- The input layer should match the environment's observation space.
- The output layer for the **actor** should depend on the action space:
  - For discrete actions: output probabilities using `nn.Softmax`.
  - For continuous actions: output mean and log std for a Gaussian distribution.
- The **critic** always outputs a single scalar value.

```python
# TODO: Define function `create_shared_network(env)`
```

#### Environments to Support:
Test your function with the following environments:
1. `CliffWalking-v0` (Use one-hot encoding for discrete integer observations.)
2. `LunarLander-v3` (Standard Box space for observations and discrete actions.)
3. `PongNoFrameskip-v4` (Use gym wrappers for Atari image preprocessing.)
4. `HalfCheetah-v5` (Continuous observation and continuous action.)

```python
# TODO: Loop through environments and test `create_shared_network`
```

Hint: Use `gym.spaces` utilities to determine observation/action types dynamically.

🔗 Observation/Action Space Docs:
- https://gymnasium.farama.org/api/spaces/

---

In [4]:
!apt-get update && apt-get install -y swig

Hit:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:2 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:3 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Hit:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Get:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Hit:6 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Hit:7 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Fetched 257 kB in 1s (197 kB/s)
Reading package lists... Done
W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
Reading package list

In [5]:
!pip install box2d-py==2.3.5
!python -c "import Box2D; print('Box2D imported successfully')"

Box2D imported successfully


In [6]:
!pip install "gymnasium[mujoco]"



In [15]:
import gymnasium as gym
import ale_py
print([env_id for env_id in gym.registry if "Pong" in env_id])


['Pong-v0', 'PongDeterministic-v0', 'PongNoFrameskip-v0', 'Pong-v4', 'PongDeterministic-v4', 'PongNoFrameskip-v4', 'Pong-ram-v0', 'Pong-ramDeterministic-v0', 'Pong-ramNoFrameskip-v0', 'Pong-ram-v4', 'Pong-ramDeterministic-v4', 'Pong-ramNoFrameskip-v4', 'ALE/Pong-v5', 'ALE/Pong-ram-v5']


In [27]:
###Shared Actor-Critic Network (Auto-Adaptive)
import torch
import torch.nn as nn
import gymnasium as gym
import numpy as np
from gymnasium.wrappers import AtariPreprocessing
import ale_py


#Define Auto-Adaptive Actor-Critic Network
class AutoSharedActorCritic(nn.Module):
    def __init__(self, input_dim, action_space): #2-layer MLP mapping input to shared features
        super().__init__()
        self.shared = nn.Sequential(
            nn.Linear(input_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 64),
            nn.ReLU())

        if isinstance(action_space, gym.spaces.Discrete):
            self.actor_head = nn.Sequential(
                nn.Linear(64, action_space.n),
                nn.Softmax(dim=-1))
            self.is_discrete = True

        elif isinstance(action_space, gym.spaces.Box):
            self.actor_mean = nn.Linear(64, action_space.shape[0])
            self.actor_logstd = nn.Parameter(torch.zeros(action_space.shape[0]))
            self.is_discrete = False

        self.critic_head = nn.Linear(64, 1)

    def forward(self, x): #applies shared and then to estimate V(s) - critic head
        shared_out = self.shared(x)
        value = self.critic_head(shared_out)
        if self.is_discrete:
            action_probs = self.actor_head(shared_out) #output acrtion prob
            return action_probs, value
        else:
            mean = self.actor_mean(shared_out)
            std = torch.exp(self.actor_logstd)
            return (mean, std), value #output mean std, of gaussian distr.


#Auto Network Constructor -builds a model based on the environment structure
def create_shared_network(env):
    obs_space = env.observation_space
    if isinstance(obs_space, gym.spaces.Discrete):
        input_dim = obs_space.n
    elif isinstance(obs_space, gym.spaces.Box):
        input_dim = int(np.prod(obs_space.shape))

    return AutoSharedActorCritic(input_dim, env.action_space)


def test_network_output(env, model, env_name): #prints the model outputs for a given environment
    print(f"\n Currently Testing: {env_name}")
    obs, _ = env.reset()

    if isinstance(env.observation_space, gym.spaces.Discrete):
        obs_tensor = torch.zeros(env.observation_space.n)
        obs_tensor[obs] = 1.0
    else:
        obs_tensor = torch.tensor(obs, dtype=torch.float32).flatten()

    model.eval()
    with torch.no_grad():
        actor_out, critic_out = model(obs_tensor)

    if isinstance(env.action_space, gym.spaces.Discrete):
        print(" Actor Output prob:", np.round(actor_out.numpy(), 4))
    else:
        mean, std = actor_out
        print(" Actor Output- Mean:", np.round(mean.numpy(), 4))
        print(" Actor Output- Std Dev:", np.round(std.numpy(), 4))

    print(" Critic Output val:", round(critic_out.item(), 4))

#CliffWalking-v0
env1 = gym.make("CliffWalking-v0")
model1 = create_shared_network(env1)
print(" CliffWalking-v0 created.")
test_network_output(env1, model1, "CliffWalking-v0")

#LunarLander-v3
env2 = gym.make("LunarLander-v3")
model2 = create_shared_network(env2)
print("\n LunarLander-v3 created.")
test_network_output(env2, model2, "LunarLander-v3")

#PongNoFrameskip-v4 - pongv4 had installation ALE_PY and ROM issues referred from gymnasium documentation
pong = gym.make("PongNoFrameskip-v4", render_mode="rgb_array", frameskip=1)
env3 = AtariPreprocessing(pong, grayscale_obs=True, scale_obs=True)
model3 = create_shared_network(env3)
print("\n PongNoFrameskip-v4 created and preprocessed.")
test_network_output(env3, model3, "PongNoFrameskip-v4")


#HalfCheetah-v5
env4 = gym.make("HalfCheetah-v5")
model4 = create_shared_network(env4)
print("\n HalfCheetah-v5 created.")
test_network_output(env4, model4, "HalfCheetah-v5")


 CliffWalking-v0 created.

 Currently Testing: CliffWalking-v0
 Actor Output prob: [0.2635 0.2568 0.233  0.2467]
 Critic Output val: 0.0039

 LunarLander-v3 created.

 Currently Testing: LunarLander-v3
 Actor Output prob: [0.2696 0.2486 0.2652 0.2166]
 Critic Output val: -0.0138

 PongNoFrameskip-v4 created and preprocessed.

 Currently Testing: PongNoFrameskip-v4
 Actor Output prob: [0.1776 0.1519 0.1723 0.18   0.1483 0.1699]
 Critic Output val: 0.0415

 HalfCheetah-v5 created.

 Currently Testing: HalfCheetah-v5
 Actor Output- Mean: [-0.0504 -0.0403 -0.1124  0.0041 -0.0766  0.1555]
 Actor Output- Std Dev: [1. 1. 1. 1. 1. 1.]
 Critic Output val: -0.1304


### Discuss the motivation behind each setup and when it may be preferred in practice.

YOUR ANSWER:
The auto-adaptive design helps detect the environment type which not only saves time with different gym enviornments with very different state/action shapes. It allows to design an actor-critic architecture to generalize with all the environments. With the autosetup for discrete action spaces we used the actor head to adapt to output action probabilities and gaussian distribution for continous as in the CliffWalking-v0 gives an integer-based observation and expects a discrete action,LunarLander-v3 gives a continuous Box vector and expects a discrete action, HalfCheetah-v5 outputs and accepts continuous vectors and finally PongNoFrameskip-v4 deals with image frames and discrete actions. If weare to use custom made architetcures for each task, this method may have to be avoided.


### Task 3: Write Observation Normalization Function
Create a function `normalize_observation(obs, env)` that:
- Checks if the observation space is `Box` and has `low` and `high` attributes.
- If so, normalize the input observation.
- Otherwise, return the observation unchanged.

```python
# TODO: Define `normalize_observation(obs, env)`
```

Test this function with observations from:
- `LunarLander-v3`
- `PongNoFrameskip-v4`

Note: Atari observations are image arrays. Normalize pixel values to [0, 1]. For LunarLander-v3, the different elements in the observation vector have different ranges. Normalize them to [0, 1] using the `low` and `high` attributes of the observation space.


---

In [31]:
#Task 3: Normalize Observations
import numpy as np
import gymnasium as gym
from gymnasium.wrappers import AtariPreprocessing
import ale_py


# Normalization Function
def normalize_observation(obs, env):# The understanding of Normalizing observations for the environment type, box space with finite low/high is matched to [0, 1], images scale to [0,1] and discrete is same return
    obs_space = env.observation_space

    if isinstance(obs_space, gym.spaces.Box):
        #Real-valued vector obs - Check if the observation space is continuous
        if np.isfinite(obs_space.low).all() and np.isfinite(obs_space.high).all(): #Handles bounded Box spaces
            norm_obs = (obs-obs_space.low)/ (obs_space.high-obs_space.low+ 1e-8) #for small vals
            return norm_obs

        #Image obs divides by 255 to normalize image observations
        if obs.max() > 1.0:
            return obs/ 255.0

    # discrete return
    return obs

#LunarLander-v3 -Box obs
print("\n Testing LunarLander-v3")
env1 = gym.make("LunarLander-v3")
obs1, _ = env1.reset()

print("Raw obs:", obs1)
norm1 = normalize_observation(obs1, env1)
print("Normalized obs:", norm1)
print("Range:", np.min(norm1), "to", np.max(norm1))

#The different elements in the observation vector have different ranges - printed out
# Test: PongNoFrameskip-v4 (Image obs)
print("\n Testing PongNoFrameskip-v4")
env2 = AtariPreprocessing(gym.make("PongNoFrameskip-v4"), grayscale_obs=True, scale_obs=True)
obs2, _ = env2.reset()
print("Raw obs shape:", obs2.shape)
norm2 = normalize_observation(obs2, env2)
print("Normalized shape:", norm2.shape)
print("Range:", np.min(norm2), "to", np.max(norm2))




 Testing LunarLander-v3
Raw obs: [ 0.0015069   1.4077027   0.15262283 -0.14300013 -0.00173939 -0.03457131
  0.          0.        ]
Normalized obs: [0.50030136 0.7815405  0.5076312  0.49285    0.49986157 0.4982714
 0.         0.        ]
Range: 0.0 to 0.7815405

 Testing PongNoFrameskip-v4
Raw obs shape: (84, 84)
Normalized shape: (84, 84)
Range: 0.20392157 to 0.9254902


### Discuss the motivation behind each setup and when it may be preferred in practice.

YOUR ANSWER:
Normalizing fucntions in RL is important for improving the learning performance because some environments return image-based inputs as in Pong and LunarLander gives vetcors of real value which now help to make all input features lie in a similar range. Another observation for the same is the usefulness in preventing gradient explosions/vanishing gradients and help in converegnce, thus enabling consustency across varying state-representations. In this case, LunarLander and PongNoFrameskip-v4  gives real value vectors and pixel based inputs which vary in scale and can create instability while training.

## Section 4: Gradient Clipping

To prevent exploding gradients, it's common practice to clip gradients before optimizer updates.

### Task 4: Clip Gradients for Actor-Critic Networks
Use dummy tensors and apply gradient clipping with the following PyTorch method:
```python
# During training, after loss.backward():
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=0.5)
```

Reuse the loss computation from Task 1a or 1b. After computing the gradients, apply gradient clipping.
Print the gradient norm before and after clipping to verify it’s applied.

🔗 PyTorch Docs: https://pytorch.org/docs/stable/generated/torch.nn.utils.clip_grad_norm_.html


---

In [32]:
#Task 4: Gradient Clipping with Shared Actor-Critic

import torch
import torch.nn as nn
from torch.nn.utils import clip_grad_norm_
from torch.distributions import Categorical

#Shared Actor-Critic from the previous module
class SharedActorCritic(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        self.shared = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU())
        self.actor_head = nn.Sequential(
            nn.Linear(hidden_dim, output_dim),
            nn.Softmax(dim=-1))
        self.critic_head = nn.Linear(hidden_dim, 1)

    def forward(self, x):
        base = self.shared(x)
        action_probs = self.actor_head(base)
        value = self.critic_head(base)
        return action_probs, value


#Setup with Dummy Inputs
input_dim = 4    #Simulated observation size
output_dim = 2   #Two possible discrete actions
hidden_dim = 64 #Hidden layer size

model = SharedActorCritic(input_dim, hidden_dim, output_dim)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-2)

#Dummystate instantiation
state = torch.tensor([[1.0, 0.0, 0.0, 0.0]])
#Simulated dummy return
dummy_return = torch.tensor([[1.5]])

#Forward Pass & Loss- Passing through model
action_probs, value = model(state)
print("\n Action Probabilities:", action_probs.detach().numpy())

#building a categorical distribution, log prob, advatnage
dist = Categorical(probs=action_probs)
action = dist.sample()
print(" Sampled Action:", action.item())
log_prob = dist.log_prob(action)
print(" Log Probability of Chosen Action:", log_prob.item())
advantage = dummy_return - value
print(" Advantage:", advantage.item())

#actor, critic and total Losses
actor_loss = -log_prob*advantage.detach()
critic_loss = nn.MSELoss()(value, dummy_return)
total_loss = actor_loss+critic_loss
print("\n Actor Loss:", actor_loss.item())
print(" Critic Loss:", critic_loss.item())
print(" Total Loss:", total_loss.item())

#Gradient Clipping section
optimizer.zero_grad()
total_loss.backward()

#Before clipping
total_norm_before = torch.norm(
    torch.stack([torch.norm(p.grad.detach()) for p in model.parameters()]))
print("\n Gradient Norm Before Clipping:", round(total_norm_before.item(), 4))

#applying gradient clipping at 0.5
clip_grad_norm_(model.parameters(), max_norm=0.5)
total_norm_after = torch.norm(
    torch.stack([torch.norm(p.grad.detach()) for p in model.parameters()]))
print(" Gradient Norm After Clipping:", round(total_norm_after.item(), 4))

#final Update model
optimizer.step()
print(" Optimizer step completed with clipped gradients operation")



 Action Probabilities: [[0.46755075 0.53244925]]
 Sampled Action: 1
 Log Probability of Chosen Action: -0.6302676796913147
 Advantage: 1.6202468872070312

 Actor Loss: 1.0211892127990723
 Critic Loss: 2.625200033187866
 Total Loss: 3.6463892459869385

 Gradient Norm Before Clipping: 7.9327
 Gradient Norm After Clipping: 0.5
 Optimizer step completed with clipped gradients operation


### Discuss the motivation behind each setup and when it may be preferred in practice.

YOUR ANSWER:
When training actor-critic networks, sometimes gradients may vanish or explode when dealing with deep architectures and to avoid this we use gradient clipping that ensures the overall norm of the gradient stays below a specified max_norm like 0.5. This may help to stabilize learning and avoid overshooting updates and on the overall makes training more reliable in different environments. With respect to LunarLander or Pong, early training gives much varying results and gradient clipping acts as a stabilizer. Rewards with very high variance along with the policy gradient -- value estimation error may make the learning unstable, hence limiting it to 0.5 - the total gradient scales down and recommended for shared architectures.



If you are working in a team, provide a contribution summary.
| Team Member | Step# | Contribution (%) |  
Team Member 1 - T1 -  Pavitran Gnanasekaran (pgnanase)  
Team Member 2 - T2 - Rahul Ekambaram (rahuleka)
| T1 50% | Task 1 | T2 50%  |  
| T1 50% | Task 2 | T2 50%  |  
| T1 50% | Task 3 | T2 50%  |  
| T1 50% | Task 4 | T2 50%  |  
| 100% | **Total** |  100% |  
