## <center>CSE 546: Reinforcement Learning</center>
### <center>Prof. Alina Vereshchaka</center>
#### <center>Spring 2025</center>

Welcome to the Assignment 3, Part 1: Introduction to Actor-Critic Methods! It includes the implementation of simple actor and critic networks and best practices used in modern Actor-Critic algorithms.

## Section 0: Setup and Imports

In [None]:
pip install "gymnasium[mujoco]"

Collecting mujoco>=2.1.5 (from gymnasium[mujoco])
  Downloading mujoco-3.3.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
Collecting glfw (from mujoco>=2.1.5->gymnasium[mujoco])
  Downloading glfw-2.9.0-py2.py27.py3.py30.py31.py32.py33.py34.py35.py36.py37.py38.p39.p310.p311.p312.p313-none-manylinux_2_28_x86_64.whl.metadata (5.4 kB)
Downloading mujoco-3.3.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.6/6.6 MB[0m [31m68.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading glfw-2.9.0-py2.py27.py3.py30.py31.py32.py33.py34.py35.py36.py37.py38.p39.p310.p311.p312.p313-none-manylinux_2_28_x86_64.whl (243 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m243.5/243.5 kB[0m [31m18.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packag

In [None]:
pip install swig

Collecting swig
  Downloading swig-4.3.0-py2.py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.whl.metadata (3.5 kB)
Downloading swig-4.3.0-py2.py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m24.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: swig
Successfully installed swig-4.3.0


In [None]:
pip install "gymnasium[box2d]"

Collecting box2d-py==2.3.5 (from gymnasium[box2d])
  Downloading box2d-py-2.3.5.tar.gz (374 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m374.4/374.4 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: box2d-py
  Building wheel for box2d-py (setup.py) ... [?25l[?25hdone
  Created wheel for box2d-py: filename=box2d_py-2.3.5-cp311-cp311-linux_x86_64.whl size=2379447 sha256=3574188d4a675f8466fffffdab628323d7ca0e68e6cf5ba41fd4c2092a38799d
  Stored in directory: /root/.cache/pip/wheels/ab/f1/0c/d56f4a2bdd12bae0a0693ec33f2f0daadb5eb9753c78fa5308
Successfully built box2d-py
Installing collected packages: box2d-py
Successfully installed box2d-py-2.3.5


In [None]:
pip install ale-py



In [None]:
import gymnasium as gym
import ale_py

gym.register_envs(ale_py)

In [None]:
from gymnasium.envs import registry

pong_envs = [env_spec.id for env_spec in registry.values() if 'PongNoFrameskip' in env_spec.id]
print(pong_envs)

['PongNoFrameskip-v0', 'PongNoFrameskip-v4']


In [None]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import matplotlib.pyplot as plt
from collections import deque
import gymnasium as gym
from gym.wrappers import AtariPreprocessing, FrameStack
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

SEED = 42
np.random.seed(SEED)
torch.manual_seed(SEED)

<torch._C.Generator at 0x7a42f1af9c70>

## Section 1: Actor-Critic Network Architectures and Loss Computation

In this section, you will explore two common architectural designs for Actor-Critic methods and implement their corresponding loss functions using dummy tensors. These architectures are:
- A. Completely separate actor and critic networks
- B. A shared network with two output heads

Both designs are widely used in practice. Shared networks are often more efficient and generalize better, while separate networks offer more control and flexibility.

---


### Task 1a – Separate Actor and Critic Networks with Loss Function

Define a class `SeparateActorCritic`. Your goal is to:
- Create two completely independent neural networks: one for the actor and one for the critic.
- The actor should output a probability distribution over discrete actions (use `nn.Softmax`).
- The critic should output a single scalar value.

 Use `nn.ReLU()` as your activation function. Include at least one hidden layer of reasonable width (e.g. 64 or 128 units).

 -------------------------------

 Next, simulate training using dummy tensors:
1. Generate dummy tensors for log-probabilities, returns, estimated values, and entropies.
2. Compute the actor loss using the advantage (return - value).
3. Compute the critic loss as mean squared error between values and returns.
4. Use a single optimizer for both the Actor and the Critic. In this case, combine the actor and critic losses into a total loss and perform backpropagation.
5. Use a separate optimizers for both the Actor and the Critic. In this case, keep the actor and critic losses separate and perform backpropagation.

🔗 Helpful references:
- PyTorch Softmax: https://pytorch.org/docs/stable/generated/torch.nn.Softmax.html
- PyTorch MSE Loss: https://pytorch.org/docs/stable/generated/torch.nn.functional.mse_loss.html

---

In [None]:
# 1A CODE BEGINS
# class SeparateActorCritic with separate networks for actor and critic
class SeparateActorCritic(nn.Module):
    def __init__(self, obs_dim, action_dim, hidden_size=128):
        super().__init__()
        # Actor network
        self.actor = nn.Sequential(
            nn.Linear(obs_dim, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, action_dim),
            nn.Softmax(dim=-1)
        )
        # Critic network
        self.critic = nn.Sequential(
            nn.Linear(obs_dim, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, 1)
        )

    def forward(self, x):
        probs = self.actor(x)
        value = self.critic(x)
        return probs, value

In [None]:
obs_dim = 4
action_dim = 2
model = SeparateActorCritic(obs_dim, action_dim)

In [None]:
actor_optimizer = optim.Adam(model.actor.parameters(), lr=1e-3)
critic_optimizer = optim.Adam(model.critic.parameters(), lr=1e-3)

In [None]:
# === Actor step ===
dummy_input = torch.rand(8, obs_dim)
action_probs, _ = model(dummy_input)
actions = torch.randint(0, action_dim, (8, 1))
log_probs = torch.log(action_probs.gather(1, actions))
entropies = -torch.sum(action_probs * torch.log(action_probs + 1e-8), dim=1, keepdim=True)
returns = torch.rand(8, 1)
values = model(dummy_input)[1].detach()
advantage = returns - values
actor_loss = -(log_probs * advantage).mean() - 0.01 * entropies.mean()

actor_optimizer.zero_grad()
actor_loss.backward()
actor_optimizer.step()

In [None]:
# === Critic step ===
_, values = model(dummy_input)
critic_loss = F.mse_loss(values, returns)

critic_optimizer.zero_grad()
critic_loss.backward()
critic_optimizer.step()

# 1A CODE ENDS

### Motivation for Using Totally Separate Actor and Critic Networks
Using totally independent networks, the actor's policy and the critic's value estimation are learned independently from each other, without shared parameters or layers. This is a simple design that has a number of benefits:

**No Interference Between Objectives**- The actor and critic optimize different loss functions (policy gradient vs. value regression). Keeping the two separate entirely avoids the risk of the actor's changes negatively influencing the critic's learning processes and vice versa.

**Flexible Architecture**- Each network can be designed for their intended functions where actor may take advantage of sharper gradients and modified layers to explore, and critic may need stability with deeper layers to regress

**Useful for Heterogeneous Tasks**- In cases when policies and value functions have very different input / output functions of characteristics of the neural networks, a separation is beneficial

**Easier to Debug or Analyze**- Separate networks allow for us to debug, analyze, and optimize models independently

### When is it preferred in practice?-

- For small and medium environments where computational efficiency is not the most important factor
- When the settings have high variance returns and the critic may benefit from more stability without needing to be influenced by rapidly changing policies
- Basic research settings where interpretability and modifiability are important (e.g., ablation studies).

### Task 1b – Shared Network with Actor and Critic Heads + Loss Function

Now define a class `SharedActorCritic`:
- Build a shared base network (e.g., linear layer + ReLU)
- Create two heads: one for actor (output action probabilities) and one for critic (output state value)

Then:
1. Pass a dummy input tensor through the model to obtain action probabilities and value.
2. Simulate dummy rewards and compute advantage.
3. Compute the actor and critic losses, combine them, and backpropagate.

 Use `nn.Softmax` for actor output and `nn.Linear` for scalar critic output.

🔗 More reading:
- Policy Gradient Methods: https://spinningup.openai.com/en/latest/algorithms/vpg.html
- Actor-Critic Overview: https://www.tensorflow.org/agents/tutorials/6_reinforce_tutorial
- PyTorch Categorical Distribution: https://pytorch.org/docs/stable/distributions.html#categorical

---

In [None]:
# 1B CODE BEGINS
# SharedActorCritic class
class SharedActorCritic(nn.Module):
    def __init__(self, obs_dim, action_dim, hidden_size=128):
        super().__init__()
        self.shared = nn.Sequential(
            nn.Linear(obs_dim, hidden_size),
            nn.ReLU()
        )
        self.actor_head = nn.Sequential(
            nn.Linear(hidden_size, action_dim),
            nn.Softmax(dim=-1)
        )
        self.critic_head = nn.Linear(hidden_size, 1)

    def forward(self, x):
        base = self.shared(x)
        action_probs = self.actor_head(base)
        value = self.critic_head(base)
        return action_probs, value

In [None]:
obs_dim = 4
action_dim = 2
model = SharedActorCritic(obs_dim, action_dim)
optimizer = optim.Adam(model.parameters(), lr=1e-3)

In [None]:
# Dummy input
dummy_input = torch.rand(8, obs_dim)
action_probs, values = model(dummy_input)

In [None]:
# Dummy actions and returns
actions = torch.randint(0, action_dim, (8, 1))
log_probs = torch.log(action_probs.gather(1, actions))
returns = torch.rand(8, 1)
entropies = -torch.sum(action_probs * torch.log(action_probs + 1e-8), dim=1, keepdim=True)

In [None]:
# Advantage
advantage = returns - values.detach()

# Losses
actor_loss = -(log_probs * advantage).mean() - 0.01 * entropies.mean()
critic_loss = F.mse_loss(values, returns)
total_loss = actor_loss + critic_loss

# Backpropagation
optimizer.zero_grad()
total_loss.backward()
optimizer.step()
# 1B CODE ENDS

### Motivation for Using Shared Actor-Critic Network with Two Heads
In this architecture, the actor networks and critic networks share a single base network that branches to specialized heads- the actor head gives actions probabilities or parameters, the critic head gives a value estimation for the state.

**Shared Representation Learning**- The base network has the ability to learn features that are useful for both policy learning and value learning. This can be especially important in environments where the state features are complex or high-dimensional (e.g., images, robotics).

**Efficiency in Parameters**- This reduces the total number of network parameters compared to using separate networks. This efficiency is critical when running agents on resource-constrained systems or in large environments (e.g., Atari, MuJoCo).

**Improving Generalization**- The shared learning of features encourages a more consistent and shared understanding of the environment which can improve the stability and convergence of both the policy and value estimation. This is important in environments with either dense or continuous observations or in environments with observations that are based on image data.

**Simplified Training Pipeline**- Often a single optimizer and a single forward pass through the original base network is adequate. If integrated effectively this code reuse and implementation of one forward pass is significantly faster and more efficient especially when leveraging libraries and libraries for multi-task RL settings.

### When is it preferred in practice?-

- This architecture is preferred with high-dimensional or complex inputs (e.g., image based observations from Atari or robotics sensors) and if computational efficiency is important (ex. real-time agents).
- When policy learning and value learning are learned with relatively shared features (i.e., have a similar reward structure) generally prefers this architecture.
- This is also useful during prototyping, or when leveraging modular libraries (e.g., TensorFlow Agents, Stable Baselines, RLlib).

## Section 2: Auto-Adaptive Network Setup for Environments

You will now create a function that builds a shared actor-critic network that adapts to any Gymnasium environment. This function should inspect the environment and build input/output layers accordingly.

### Task 2: Auto-generate Input and Output Layers
Write a function `create_shared_network(env)` that constructs a neural network using the following rules:
- The input layer should match the environment's observation space.
- The output layer for the **actor** should depend on the action space:
  - For discrete actions: output probabilities using `nn.Softmax`.
  - For continuous actions: output mean and log std for a Gaussian distribution.
- The **critic** always outputs a single scalar value.

#### Environments to Support:
Test your function with the following environments:
1. `CliffWalking-v0` (Use one-hot encoding for discrete integer observations.)
2. `LunarLander-v3` (Standard Box space for observations and discrete actions.)
3. `PongNoFrameskip-v4` (Use gym wrappers for Atari image preprocessing.)
4. `HalfCheetah-v5` (Continuous observation and continuous action.)

Hint: Use `gym.spaces` utilities to determine observation/action types dynamically.

🔗 Observation/Action Space Docs:
- https://gymnasium.farama.org/api/spaces/

---

In [None]:
class SharedActorCriticAuto(nn.Module):
    def __init__(self, obs_dim, action_space, hidden_size=128):
        super().__init__()
        self.shared = nn.Sequential(
            nn.Linear(obs_dim, hidden_size),
            nn.ReLU()
        )
        self.is_discrete = isinstance(action_space, gym.spaces.Discrete)
        self.is_box = isinstance(action_space, gym.spaces.Box)

        if self.is_discrete:
            self.actor = nn.Sequential(
                nn.Linear(hidden_size, action_space.n),
                nn.Softmax(dim=-1)
            )
        elif self.is_box:
            action_dim = action_space.shape[0]
            self.actor_mean = nn.Linear(hidden_size, action_dim)
            self.actor_log_std = nn.Parameter(torch.zeros(action_dim))
        else:
            raise NotImplementedError("Unsupported action space")

        self.critic = nn.Linear(hidden_size, 1)

    def forward(self, x):
        base = self.shared(x)
        value = self.critic(base)
        if self.is_discrete:
            action_probs = self.actor(base)
            return action_probs, value
        else:
            mean = self.actor_mean(base)
            log_std = self.actor_log_std.expand_as(mean)
            return (mean, log_std), value

In [None]:
def preprocess_obs(env, obs):
    if isinstance(env.observation_space, gym.spaces.Discrete):
        one_hot = np.zeros(env.observation_space.n, dtype=np.float32)
        one_hot[obs] = 1.0
        return torch.tensor(one_hot).unsqueeze(0)
    elif isinstance(env.observation_space, gym.spaces.Box):
        obs = np.array(obs, dtype=np.float32)
        if obs.ndim > 1:
            obs = obs.flatten()  # For images or frame stacks
        return torch.tensor(obs).unsqueeze(0)
    else:
        raise NotImplementedError("Unsupported observation space")

In [None]:
def create_shared_network(env):
    if isinstance(env.observation_space, gym.spaces.Discrete):
        obs_dim = env.observation_space.n
    elif isinstance(env.observation_space, gym.spaces.Box):
        obs_dim = int(np.prod(env.observation_space.shape))
    else:
        raise NotImplementedError("Unsupported observation space")

    return SharedActorCriticAuto(obs_dim, env.action_space)

In [None]:
env_ids = ["CliffWalking-v0", "LunarLander-v3", "HalfCheetah-v5"]

for env_id in env_ids:
    print(f"\n=== Testing {env_id} ===")
    env = gym.make(env_id)

    model = create_shared_network(env)

    obs = env.reset()
    if isinstance(obs, tuple):
        obs = obs[0]

    processed_obs = preprocess_obs(env, obs)

    with torch.no_grad():
        output = model(processed_obs)

    print(f"Output type: {type(output)}")
    if isinstance(output, tuple):
        print(f"Actor output shape: {output[0] if isinstance(output[0], torch.Tensor) else 'Gaussian tuple'}")
        print(f"Critic output shape: {output[1].shape}")


=== Testing CliffWalking-v0 ===
Output type: <class 'tuple'>
Actor output shape: tensor([[0.2492, 0.2529, 0.2381, 0.2597]])
Critic output shape: torch.Size([1, 1])

=== Testing LunarLander-v3 ===
Output type: <class 'tuple'>
Actor output shape: tensor([[0.2595, 0.2355, 0.2647, 0.2402]])
Critic output shape: torch.Size([1, 1])

=== Testing HalfCheetah-v5 ===
Output type: <class 'tuple'>
Actor output shape: Gaussian tuple
Critic output shape: torch.Size([1, 1])


In [None]:
env_id = 'PongNoFrameskip-v4'
print(f"\n=== Testing {env_id} ===")

env = gym.make(env_id)

model = create_shared_network(env)

obs = env.reset()
if isinstance(obs, tuple):
    obs = obs[0]

processed_obs = preprocess_obs(env, obs)

with torch.no_grad():
    output = model(processed_obs)

print(f"Output type: {type(output)}")
if isinstance(output, tuple):
    print(f"Actor output shape: {output[0] if isinstance(output[0], torch.Tensor) else 'Gaussian tuple'}")
    print(f"Critic output shape: {output[1].shape}")


=== Testing PongNoFrameskip-v4 ===
Output type: <class 'tuple'>
Actor output shape: tensor([[6.2994e-01, 1.3675e-23, 3.7006e-01, 1.4152e-20, 3.7782e-24, 7.9731e-28]])
Critic output shape: torch.Size([1, 1])


### Motivation for Using Auto-Adaptive Network Setup
The Auto-Adaptive Network Setup entails constructing neural network architectures which can flexibly adapt to any observation and action spaces of the environment. Instead of hardcoding input-output dimensions, the network refers to the environment and adapts itself.

**Generalization Across Environments**- Reinforcement learning agents are now reusable in multiple environments (e.g., Atari, MuJoCo, GridWorld) without having to rewrite the architecture of the model.Facilitates rendering modularity and cleaner programming.

**Faster Prototyping and Testing**- Adapts automatically to the environment structure, which allows researchers to examine new tasks or environments quickly and reliably.Greatly useful in automated pipelines or for hyperparameter sweeps.

**Robustness Against Variation**

*Works against different types of observations*:
 observations → one-hot encoded. Box (vector/image) observations → flattened or normalized.

*Handles both discrete and continuous action*: Discrete actions → probabilities. Continuous actions → gaussian parameters.

**Essential for Multi-Task or Meta-RL:**- Where an agent is trained in multiple tasks with different spaces.Facilitates scaling across tasks with only minimal code changes.

### When is it preferred in practice?-

- Within frameworks or libraries that support many environments (e.g., OpenAI Gym, Meta-RL).
- Within course projects or research within benchmarking multiple tasks.
- Within production or robotics in evolving environments, thus needing dynamic reconfiguration
- Designing generalizable RL agents not hardwired to be in a specific environment.

## Section 3: Observation Normalization

### Task 3: Write Observation Normalization Function
Create a function `normalize_observation(obs, env)` that:
- Checks if the observation space is `Box` and has `low` and `high` attributes.
- If so, normalize the input observation.
- Otherwise, return the observation unchanged.

Test this function with observations from:
- `LunarLander-v3`
- `PongNoFrameskip-v4`

Note: Atari observations are image arrays. Normalize pixel values to [0, 1]. For LunarLander-v3, the different elements in the observation vector have different ranges. Normalize them to [0, 1] using the `low` and `high` attributes of the observation space.


---

In [None]:
def normalize_observation(obs, env):
    space = env.observation_space

    if isinstance(space, gym.spaces.Box):
        obs = np.array(obs, dtype=np.float32)

        # Normalize image observations
        if np.issubdtype(obs.dtype, np.integer) and obs.max() > 1:
            return obs / 255.0
        if hasattr(space, "low") and hasattr(space, "high"):
            low = space.low
            high = space.high
            # Prevent divide-by-zero
            scale = np.where(high - low == 0, 1.0, high - low)
            return (obs - low) / scale

    # For Discrete states
    return obs

In [None]:
envs = ["LunarLander-v3"]
env = gym.make(env_id)
obs, _ = env.reset() if isinstance(env.reset(), tuple) else (env.reset(), {})
norm_obs = normalize_observation(obs, env)

print(f"\n=== {env_id} ===")
print(f"Original dtype: {np.array(obs).dtype}, shape: {np.array(obs).shape}")
print(f"Normalized dtype: {norm_obs.dtype}, shape: {norm_obs.shape}")
print(f"Min: {norm_obs.min():.4f}, Max: {norm_obs.max():.4f}")


=== LunarLander-v3 ===
Original dtype: float32, shape: (8,)
Normalized dtype: float32, shape: (8,)
Min: 0.0000, Max: 0.7819


In [None]:
envs = ["PongNoFrameskip-v4"]
env = gym.make(env_id)
obs, _ = env.reset() if isinstance(env.reset(), tuple) else (env.reset(), {})
norm_obs = normalize_observation(obs, env)

print(f"\n=== {env_id} ===")
print(f"Original dtype: {np.array(obs).dtype}, shape: {np.array(obs).shape}")
print(f"Normalized dtype: {norm_obs.dtype}, shape: {norm_obs.shape}")
print(f"Min: {norm_obs.min():.4f}, Max: {norm_obs.max():.4f}")


=== PongNoFrameskip-v4 ===
Original dtype: uint8, shape: (210, 160, 3)
Normalized dtype: float64, shape: (210, 160, 3)
Min: 0.0000, Max: 0.8941


### Motivation for Observation Normalization
Creating an Observation Normalization Function with an Auto-Adaptive Network Architecture is a good approach in reinforcement learning (RL) when the observation space can vary in scale and type and environment variety ranges in complexity.

**Make Learning More Stable**- Observations across different environments, even when they are similar in type can have annotation values (features) that have massively different ranges of values.

Examples include (but are not limited to):
- LunarLander (continuous values in varying ranges; position, velocity, angle, etc.).
- Pong (pixels assigned intensity values in the range of 0-255).
- CliffWalking (discrete states represented with integers).

Learning how components of a model interact together quickly becomes complex when assigned observations and attribution values vary across the scale without a normalization mechanism.Without normalizing observations, Neural networks will struggle to learn, especially when the scales of input features are not consistent.Gradient updates can become unstable leading to a lack of convergence entirely or poor convergence. Normalizing observations provides a mapping for input values to a standard range of values, typically [0, 1] or [-1, 1], which helps normalize observation features into a more learning stable structure.

**Generalizing Performance Across Auto-Adaptive Networks**- Our create_shared_network(env) adapts models based on the construction details of arbitrary environments. To generalize our controller across varied environments -- it must abstract away observation scale differences and with normalization this ensures a consistent observation behavior, and as such a more reusable and environment agnostic shared network.

**Avoid Implicit Bias**- Without normalized input features across a model, observed features with larger values will dominate the learning (similar to velocities vs. contact booleans of LunarLander). This will promote an uneven model (implicit bias) on the dimensional state. With normalization we are creating a potential of equal footing for all input features and in a sense letting the agent decide what is indeed important.

### When is it preferred in practice?-

- For Continuous control (e.g., HalfCheetah, LunarLander) where Large variation across feature scales will affect learning.
- For Image RL (e.g., Pong, Breakout) while Normalizing pixel intensity values into [0,1].
- Across multiple environments (e.g., meta-RL, AutoRL setups) for a consistent behavior across contexts.

## Section 4: Gradient Clipping

To prevent exploding gradients, it's common practice to clip gradients before optimizer updates.

### Task 4: Clip Gradients for Actor-Critic Networks
Use dummy tensors and apply gradient clipping with the following PyTorch method:
```python
# During training, after loss.backward():
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=0.5)
```

Reuse the loss computation from Task 1a or 1b. After computing the gradients, apply gradient clipping.
Print the gradient norm before and after clipping to verify it’s applied.

🔗 PyTorch Docs: https://pytorch.org/docs/stable/generated/torch.nn.utils.clip_grad_norm_.html

In [None]:
import torch
import torch.nn.functional as F
obs_dim = 10
action_dim = 4
model = SharedActorCriticAuto(obs_dim, gym.spaces.Discrete(action_dim))
actor_optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
critic_optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
dummy_input = torch.rand(8, obs_dim)
action_probs, _ = model(dummy_input)
actions = torch.randint(0, action_dim, (8, 1))
log_probs = torch.log(action_probs.gather(1, actions))
entropies = -torch.sum(action_probs * torch.log(action_probs + 1e-8), dim=1, keepdim=True)
returns = torch.rand(8, 1)
values = model(dummy_input)[1].detach()
advantage = returns - values
actor_loss = -(log_probs * advantage).mean() - 0.01 * entropies.mean()

actor_optimizer.zero_grad()
actor_loss.backward()
total_norm_before = torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1e10)
print(f"[Actor] Gradient norm before clipping: {total_norm_before:.4f}")

total_norm_after = torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=0.5)
print(f"[Actor] Gradient norm after clipping:  {total_norm_after:.4f}")

actor_optimizer.step()
_, values = model(dummy_input)
critic_loss = F.mse_loss(values, returns)

critic_optimizer.zero_grad()
critic_loss.backward()
total_norm_before = torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1e10)
print(f"[Critic] Gradient norm before clipping: {total_norm_before:.4f}")

total_norm_after = torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=0.5)
print(f"[Critic] Gradient norm after clipping:  {total_norm_after:.4f}")

critic_optimizer.step()

[Actor] Gradient norm before clipping: 0.3768
[Actor] Gradient norm after clipping:  0.3768
[Critic] Gradient norm before clipping: 2.6152
[Critic] Gradient norm after clipping:  2.6152


### Motivation for Gradient Clipping
**Prevent Exploding Gradients**- When large gradients cause problematic backpropagation, they can create unstable updates to model parameters.This can cause:-Nan losses,-blown up model weights,-dramatic divergence in training.Clipping makes sure that gradient values cap out at a maximum value to keep it in a range that is safe, and to promote more controlled and stable learning.

**Stabilize Actor-Critic Learning**- In RL (especially actor-critic algorithms) there is a lot of dependence on the actor and critic.If the gradients in one of the two networks explodes, this can destabilize the other. By clipping the gradients we can avoid feedback loops of instability between the actor and the critic.

** Better Optimization Dynamics**- Large gradient norms can cause the optimizer to overshoot the local minimum.Clipping works as a damping effect, which is particularly useful when using adaptive optimizers such as Adam.

### When is it preferred in practice?-

- In PPO, A2C, DDPG To increase stability by learning from noisy/unstable returns.
- In Recurrent Neural Networks (LSTMs/GRUs) Because RNNs are highly prone to exploding gradients.
- In Large or Deep Networks as The deeper you go the more at risk you are for exploding gradients
- For Sparse or High Variance Rewards Because it is not uncommon for gradient magnitudes to spike without clear reason.

If you are working in a team, provide a contribution summary.
| Team Member | Step# | Contribution (%) |
|---|---|---|

| Lalasa  | Task 1 | 100  |

| Lalasa  | Task 2 |  100 |

| Lalasa   | Task 3 |  100 |

| Lalasa  | Task 4 | 100  |

| Both  | **Total** | 100  |

Part 2: Kanisha - 100%