<a href="https://colab.research.google.com/github/SharmilNK/RL_Labs/blob/main/RL_CH1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Train and evaluate a deep RL agent ( PPO ) in a stochastic, procedurally generated environment ( MiniGrid), identify a failure mode (poor generalization) and propose a mitigation strategy.

In [None]:
!pip install -q minigrid stable-baselines3 gymnasium



[notice] A new release of pip is available: 25.2 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
import gymnasium as gym
import numpy as np
import minigrid
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env


Create the enviroment

Procedurally Generated Environment :  MiniGrid-MultiRoom-N2-S4-v0

In [None]:
env_id = "MiniGrid-MultiRoom-N2-S4-v0"

def make_env(seed):
    env = gym.make(env_id)
    env = FlatObsWrapper(env)
    env.reset(seed=seed)
    return env

Create the PPO agent

In [None]:
model = PPO("MlpPolicy", env, verbose=1)



Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


Training seeds: 0–99 (100 different procedural layouts)

Test seeds: 100–119 (held-out seeds for evaluation)

Create Training environment

In [None]:
from stable_baselines3.common.vec_env import DummyVecEnv

# Increased train seeds for better diversity and generalization
train_seeds = list(range(0, 100))  # 100 different procedural layouts

class RandomSeedWrapper(gym.Wrapper):
    """On each reset, use a random train seed so each episode sees a different procedural layout."""
    def __init__(self, env, seeds):
        super().__init__(env)
        self.seeds = seeds
    def reset(self, **kwargs):
        seed = int(np.random.choice(self.seeds))
        kwargs.pop("seed", None)  # avoid passing seed twice
        return self.env.reset(seed=seed, **kwargs)

def make_train_env():
    env = gym.make(env_id)
    env = FlatObsWrapper(env)
    env = RandomSeedWrapper(env, train_seeds)
    env.reset(seed=int(np.random.choice(train_seeds)))
    return env

env = DummyVecEnv([make_train_env])


Train PPO agent

In [None]:
# Improved hyperparameters for better learning
model = PPO(
    "MlpPolicy",
    env,
    verbose=1,
    ent_coef=0.01,  # Entropy bonus for exploration
    learning_rate=3e-4,
    n_steps=2048,  # Steps per environment per update
    batch_size=64,
    n_epochs=10,  # Number of optimization epochs per update
    gamma=0.99,
    gae_lambda=0.95,
    clip_range=0.2,
    tensorboard_log="./ppo_minigrid_tensorboard/"
)

# Train for more timesteps to improve learning and generalization
model.learn(total_timesteps=500000)

Using cpu device
Logging to ./ppo_minigrid_tensorboard/PPO_5


-----------------------------
| time/              |      |
|    fps             | 481  |
|    iterations      | 1    |
|    time_elapsed    | 4    |
|    total_timesteps | 2048 |
-----------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 306         |
|    iterations           | 2           |
|    time_elapsed         | 13          |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.017017473 |
|    clip_fraction        | 0.17        |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.94       |
|    explained_variance   | -2.08       |
|    learning_rate        | 0.0003      |
|    loss                 | -0.0379     |
|    n_updates            | 10          |
|    policy_gradient_loss | -0.0146     |
|    value_loss           | 0.018       |
-----------------------------------------
----------------------------------

<stable_baselines3.ppo.ppo.PPO at 0x23c04f14750>

Interpret Results:


| Parameter                | What It Means                            | Typical Range                     | How To Interpret                                                                                        |
| ------------------------ | ---------------------------------------- | --------------------------------- | ------------------------------------------------------------------------------------------------------- |
| **ep_len_mean**          | Average number of steps per episode      | 1 → max episode length (~100–200) | Lower is better. Decreasing over time = agent solving faster. High = wandering/failing.                 |
| **ep_rew_mean**          | Average reward per episode               | 0 → 1 (MiniGrid)                  | Near 0 = failing. ~0.5 = partial success. Near 1 = strong performance. Should increase during training. |
| **fps**                  | Frames per second (training speed)       | 100–1000+                         | Only measures speed. Not related to learning quality.                                                   |
| **iterations**           | Number of PPO training cycles completed  | Increasing integer                | Just progress indicator.                                                                                |
| **time_elapsed**         | Seconds since training started           | Increasing                        | Runtime only.                                                                                           |
| **total_timesteps**      | Total environment steps taken            | 0 → chosen training limit         | Main measure of training progress.                                                                      |
| **approx_kl**            | How much the policy changed in an update | 0.005 – 0.03 typical              | ~0.01–0.02 = healthy. Too high (>0.1) = unstable. Too low = slow learning.                              |
| **clip_fraction**        | % of updates clipped by PPO              | 0.1 – 0.3 typical                 | ~0.2 normal. Near 0 = small updates. Near 1 = overly aggressive updates.                                |
| **clip_range**           | PPO clipping threshold (hyperparameter)  | Usually 0.2                       | Fixed value. Controls update size constraint.                                                           |
| **entropy_loss**         | Measures randomness of policy            | Negative number                   | More negative = more random. Should slowly decrease as agent learns.                                    |
| **explained_variance**   | How well value function predicts returns | -∞ → 1                            | 1 = perfect. 0 = useless. <0 = bad. 0.3–0.8 during training = normal.                                   |
| **learning_rate**        | Step size for neural network updates     | ~0.0003 default                   | Too high = unstable. Too low = slow learning.                                                           |
| **loss**                 | Total combined PPO loss                  | Varies                            | Not very interpretable alone. Used for optimization.                                                    |
| **policy_gradient_loss** | Policy improvement term                  | Small negative value              | Negative is normal. Large magnitude may indicate instability.                                           |
| **value_loss**           | Error in value function prediction       | ≥ 0                               | Lower is better. Large values mean poor value predictions.                                              |
| **n_updates**            | Number of gradient updates performed     | Increasing integer                | Indicates training progress.                                                                            |


-------------------------------------------------

Save PPO model

In [None]:
model.save("ppo_minigrid_model")


Evaluate on held-out test seeds

In [None]:
# 1) Sanity check: evaluate on a few TRAIN seeds (agent has seen these layouts)
train_eval_seeds = train_seeds[:5]  # first 5 train seeds
n_episodes_per_seed = 3
train_rewards = []
for seed in train_eval_seeds:
    eval_env = gym.make(env_id)
    eval_env = FlatObsWrapper(eval_env)
    for _ in range(n_episodes_per_seed):
        obs, _ = eval_env.reset(seed=int(seed))
        done, total_rew = False, 0.0
        while not done:
            action, _ = model.predict(obs, deterministic=True)
            obs, reward, terminated, truncated, _ = eval_env.step(action)
            done = terminated or truncated
            total_rew += reward
        train_rewards.append(total_rew)
    eval_env.close()
print(f"Mean reward on TRAIN seeds (sample {train_eval_seeds[0]}-{train_eval_seeds[-1]}): {np.mean(train_rewards):.4f} (+/- {np.std(train_rewards):.4f})")

# 2) Generalization: evaluate on held-out TEST seeds
test_seeds = list(range(100, 120))
rewards_per_seed = []
for seed in test_seeds:
    eval_env = gym.make(env_id)
    eval_env = FlatObsWrapper(eval_env)
    episode_rewards = []
    for _ in range(n_episodes_per_seed):
        obs, _ = eval_env.reset(seed=int(seed))
        done, total_rew = False, 0.0
        while not done:
            action, _ = model.predict(obs, deterministic=True)
            obs, reward, terminated, truncated, _ = eval_env.step(action)
            done = terminated or truncated
            total_rew += reward
        episode_rewards.append(total_rew)
    rewards_per_seed.append(np.mean(episode_rewards))
    eval_env.close()

print(f"Mean reward on TEST seeds ({test_seeds[0]}-{test_seeds[-1]}): {np.mean(rewards_per_seed):.4f} (+/- {np.std(rewards_per_seed):.4f})")
print(f"Min: {np.min(rewards_per_seed):.4f}, Max: {np.max(rewards_per_seed):.4f}")


Mean reward on TRAIN seeds (sample 0-4): 0.8290 (+/- 0.0305)
Mean reward on TEST seeds (100-119): 0.8515 (+/- 0.0417)
Min: 0.7750, Max: 0.9100


Evaluation after the first run, gave below results -

Mean reward on test seeds (20-39): 0.0000 (+/- 0.0000)
Min: 0.0000, Max: 0.0000

Interpretation:

On every evaluation episode (across all test seeds 20–39), the agent got no reward. So it never reached the goal on any of those levels.
This shows strong overfitting, the policy learned layouts from seeds 0–19 and doesn’t transfer to the new layouts from 20–39. This is a failure mode of poor generalization.

Mitigation Stratergy

1. More training seeds (0–99 instead of 0–19)
Before: The agent saw only 20 different level layouts.
After: It sees 100 different layouts.

2. Longer training (500,000 steps instead of 200,000)
Before: Training stopped earlier.
After: Training runs 2.5× longer.The agent gets more experience and time to learn a good, general policy.

3. Better PPO settings
Before: Mostly default hyperparameters.
After: Set things like learning rate, number of epochs per update, batch size, etc. This makes it more stable and effective learning.

4. RandomSeedWrapper
On every new episode, the training environment picks a random seed from 0–99 and resets with that seed.
Each episode can be a different layout, so the agent really trains on many levels instead of repeating the same one.

Visualize Results

In [None]:
import time
from stable_baselines3 import PPO

model = PPO.load("ppo_minigrid_model")
test_seeds_visual = [100, 105, 110, 115]  # subset of test seeds (100-119)
step_delay = 0.15  # seconds between steps (increase to 0.3 if too fast)

for seed in test_seeds_visual:
    env = gym.make(env_id, render_mode="human")
    env = FlatObsWrapper(env)
    obs, _ = env.reset(seed=int(seed))
    total_rew = 0.0
    done = False
    while not done:
        action, _ = model.predict(obs, deterministic=True)
        print("action:", action)
        obs, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated
        total_rew += reward
        time.sleep(step_delay)  # pause so you can see: move, toggle door, go through, reach goal
    print(f"Seed {seed} — Episode reward: {total_rew:.4f}")
    env.close()

print("Done. Doors open when agent uses toggle (action 5) while facing them.")

action: 2
action: 0
action: 5
action: 2
action: 2
action: 2
Seed 100 — Episode reward: 0.8650
action: 1
action: 1
action: 2
action: 5
action: 2
action: 2
action: 2
Seed 105 — Episode reward: 0.8425
action: 1
action: 5
action: 2
action: 2
Seed 110 — Episode reward: 0.9100
action: 1
action: 5
action: 2
action: 2
action: 0
action: 2
Seed 115 — Episode reward: 0.8650
Done. Doors open when agent uses toggle (action 5) while facing them.


With the current setup (many train seeds, random seed wrapper, 500K steps, entropy bonus), the agent did generalize: test performance slightly exceeds the train sample.
Conclusion: The mitigations (diverse train seeds, randomization, enough training) helped the agent learn a robust policy that works on held-out test seeds, not just on the training set.