# Tutorial 4: Building the Environment

**Goal:** Create a PettingZoo-compatible environment using HERON's adapter.

**Time:** ~15 minutes

---

## The Environment's Role

The environment **orchestrates** agents and **interfaces** with RL frameworks:

```
RLlib / StableBaselines3
    │
    └── PettingZoo API (step, reset, observe)
            │
            └── HERON Environment (PettingZooParallelEnv)
                    │
                    ├── Microgrid 0 (CoordinatorAgent)
                    │   ├── Battery (FieldAgent)
                    │   └── Generator (FieldAgent)
                    ├── Microgrid 1 ...
                    └── Microgrid 2 ...
```

**Key pattern:** Use `PettingZooParallelEnv` (not raw `ParallelEnv`) to get HERON features.

## Step 1: Define Our Simple Agents

First, let's recreate our agents from Tutorial 3 in a compact form.

In [None]:
import numpy as np
from typing import Any, Dict, List, Optional, Tuple
from dataclasses import dataclass
from gymnasium.spaces import Box

from heron.core.feature import FeatureProvider
from heron.core.state import FieldAgentState
from heron.core.action import Action
from heron.agents.field_agent import FieldAgent
from heron.agents.coordinator_agent import CoordinatorAgent
from heron.protocols.vertical import SetpointProtocol


# === Features ===
@dataclass
class BatterySOC(FeatureProvider):
    visibility = ['owner', 'upper_level']
    soc: float = 0.5
    
    def vector(self) -> np.ndarray:
        return np.array([self.soc], dtype=np.float32)
    
    def names(self): return ['soc']
    def to_dict(self): return {'soc': self.soc}
    @classmethod
    def from_dict(cls, d): return cls(**d)
    def set_values(self, **kw): 
        if 'soc' in kw: self.soc = float(kw['soc'])


@dataclass
class GenOutput(FeatureProvider):
    visibility = ['owner', 'upper_level', 'system']
    p_mw: float = 0.0
    p_max: float = 5.0
    
    def vector(self) -> np.ndarray:
        return np.array([self.p_mw / self.p_max], dtype=np.float32)
    
    def names(self): return ['p_norm']
    def to_dict(self): return {'p_mw': self.p_mw, 'p_max': self.p_max}
    @classmethod
    def from_dict(cls, d): return cls(**d)
    def set_values(self, **kw): 
        if 'p_mw' in kw: self.p_mw = float(kw['p_mw'])


# === Field Agents ===
class SimpleBattery(FieldAgent):
    def __init__(self, agent_id: str, capacity: float = 2.0, upstream_id: str = None):
        self.capacity = capacity
        super().__init__(agent_id=agent_id, upstream_id=upstream_id, config={'name': agent_id})
        self.state = FieldAgentState(owner_id=agent_id, owner_level=1)
        self.state.register_feature('soc', BatterySOC(soc=0.5))
        self.action_space = Box(-1, 1, (1,), np.float32)
        self.observation_space = Box(-np.inf, np.inf, (1,), np.float32)
    
    def observe(self, gs=None): return self.state.vector()
    def step(self, action, dt=1.0):
        power = float(action[0]) * 0.5  # Max 0.5 MW
        soc = self.state.features['soc'].soc + power * dt / self.capacity
        self.state.features['soc'].soc = np.clip(soc, 0.1, 0.9)
        return {'power_mw': power, 'soc': self.state.features['soc'].soc}
    def reset(self, seed=None):
        self.state.features['soc'].soc = 0.5
        return self.observe()


class SimpleGen(FieldAgent):
    def __init__(self, agent_id: str, p_max: float = 5.0, cost: float = 50.0, upstream_id: str = None):
        self.p_max = p_max
        self.cost = cost
        super().__init__(agent_id=agent_id, upstream_id=upstream_id, config={'name': agent_id})
        self.state = FieldAgentState(owner_id=agent_id, owner_level=1)
        self.state.register_feature('output', GenOutput(p_mw=0, p_max=p_max))
        self.action_space = Box(0, 1, (1,), np.float32)
        self.observation_space = Box(-np.inf, np.inf, (1,), np.float32)
    
    def observe(self, gs=None): return self.state.vector()
    def step(self, action, dt=1.0):
        p = float(action[0]) * self.p_max
        self.state.features['output'].p_mw = p
        return {'power_mw': p, 'cost': p * dt * self.cost}
    def reset(self, seed=None):
        self.state.features['output'].p_mw = 0
        return self.observe()


# === Coordinator Agent (Microgrid) ===
class SimpleMicrogrid(CoordinatorAgent):
    def __init__(self, agent_id: str, load: float = 3.0, upstream_id: str = None):
        self.load = load
        super().__init__(agent_id=agent_id, upstream_id=upstream_id, protocol=SetpointProtocol())
        
        # Create subordinates
        self.battery = SimpleBattery(f'{agent_id}_bat', upstream_id=agent_id)
        self.gen = SimpleGen(f'{agent_id}_gen', upstream_id=agent_id)
        self.subordinates = {self.battery.agent_id: self.battery, self.gen.agent_id: self.gen}
        
        # Spaces: observe [bat_soc, gen_out, load], act [bat_setpoint, gen_setpoint]
        self.observation_space = Box(-np.inf, np.inf, (3,), np.float32)
        self.action_space = Box(np.array([-1, 0]), np.array([1, 1]), dtype=np.float32)
    
    def observe(self, gs=None) -> np.ndarray:
        bat_soc = self.battery.state.features['soc'].soc
        gen_out = self.gen.state.features['output'].p_mw / self.gen.p_max
        return np.array([bat_soc, gen_out, self.load / 10], dtype=np.float32)
    
    def step(self, action: np.ndarray, dt: float = 1.0) -> Dict:
        bat_res = self.battery.step(action[0:1], dt)
        gen_res = self.gen.step(action[1:2], dt)
        net_power = gen_res['power_mw'] - bat_res['power_mw']
        imbalance = abs(self.load - net_power)
        return {'net_power': net_power, 'imbalance': imbalance, 'cost': gen_res['cost']}
    
    def reset(self, seed=None):
        self.battery.reset(seed)
        self.gen.reset(seed)
        return self.observe()


print("Agents defined!")

## Step 2: Build the Multi-Agent Environment

HERON provides `PettingZooParallelEnv` - an adapter that combines:
- PettingZoo's `ParallelEnv` interface (for RL framework compatibility)
- HERON's `HeronEnvCore` mixin (for agent management, event-driven execution)

**Why use HERON's adapter instead of raw ParallelEnv?**
- Built-in agent registration (`register_agent`, `register_agents`)
- Event-driven execution support (`setup_event_driven`, `run_event_driven`)
- Message broker integration for distributed mode
- SystemAgent/ProxyAgent support
- Helper methods for space initialization

In [None]:
from heron.envs.adapters import PettingZooParallelEnv
from gymnasium.spaces import Dict as DictSpace


class SimpleMultiMicrogridEnv(PettingZooParallelEnv):
    """A simple multi-agent environment with 3 microgrids.
    
    This demonstrates the key HERON patterns:
    - Inheriting from PettingZooParallelEnv (not raw ParallelEnv)
    - Using register_agent() for HERON agent management
    - PettingZoo compatibility
    - Reward shaping for cooperation
    """
    
    metadata = {'render_modes': ['human'], 'name': 'simple_microgrids_v0'}
    
    def __init__(
        self,
        num_microgrids: int = 3,
        max_steps: int = 96,  # 4 days at hourly steps
        share_reward: bool = True,
        penalty: float = 10.0,
    ):
        """Initialize environment.
        
        Args:
            num_microgrids: Number of microgrid agents
            max_steps: Episode length
            share_reward: If True, all agents get same reward (encourages cooperation)
            penalty: Penalty coefficient for imbalance
        """
        # Initialize HERON's PettingZoo adapter
        super().__init__(env_id="simple_microgrids")
        
        self.num_microgrids = num_microgrids
        self.max_steps = max_steps
        self.share_reward = share_reward
        self.penalty = penalty
        
        # Create and register agents using HERON's agent management
        self.agents_dict: Dict[str, SimpleMicrogrid] = {}
        loads = [3.0, 4.0, 2.5]  # Different loads for each microgrid
        
        for i in range(num_microgrids):
            agent_id = f'mg_{i}'
            agent = SimpleMicrogrid(
                agent_id=agent_id,
                load=loads[i % len(loads)],
                upstream_id='dso'
            )
            self.agents_dict[agent_id] = agent
            # Register with HERON (enables event-driven execution, message broker, etc.)
            self.register_agent(agent)
        
        # Setup PettingZoo required attributes using HERON helpers
        self._set_agent_ids(list(self.agents_dict.keys()))
        self._init_spaces(
            action_spaces={aid: agent.action_space for aid, agent in self.agents_dict.items()},
            observation_spaces={aid: agent.observation_space for aid, agent in self.agents_dict.items()},
        )
        
        # State
        self._step_count = 0
        self._cumulative_cost = 0.0
    
    # observation_spaces and action_spaces are now handled by PettingZooParallelEnv
    # We just need to provide the property aliases for compatibility
    @property
    def observation_space(self) -> Dict[str, Box]:
        return self.observation_spaces
    
    @property
    def action_space(self) -> Dict[str, Box]:
        return self.action_spaces
    
    def reset(
        self, 
        seed: Optional[int] = None,
        options: Optional[Dict] = None
    ) -> Tuple[Dict[str, np.ndarray], Dict[str, Dict]]:
        """Reset environment and all agents.
        
        Returns:
            observations: Dict of agent_id -> observation
            infos: Dict of agent_id -> info dict
        """
        self._step_count = 0
        self._cumulative_cost = 0.0
        self._agents = self._possible_agents.copy()  # Reset active agents
        
        # Use HERON's reset_agents helper
        self.reset_agents(seed=seed)
        
        # Collect observations
        observations = {aid: agent.observe() for aid, agent in self.agents_dict.items()}
        infos = {aid: {} for aid in self.agents}
        
        return observations, infos
    
    def step(
        self, 
        actions: Dict[str, np.ndarray]
    ) -> Tuple[
        Dict[str, np.ndarray],  # observations
        Dict[str, float],       # rewards
        Dict[str, bool],        # terminateds
        Dict[str, bool],        # truncateds
        Dict[str, Dict]         # infos
    ]:
        """Execute actions for all agents.
        
        This is where HERON's agent-centric design shines:
        - Each agent steps independently
        - We aggregate results for reward computation
        """
        self._step_count += 1
        self._timestep = self._step_count  # Update HERON's internal timestep
        
        # Step each agent
        results = {}
        total_cost = 0.0
        total_imbalance = 0.0
        
        for agent_id, agent in self.agents_dict.items():
            action = actions.get(agent_id, agent.action_space.sample())
            results[agent_id] = agent.step(action)
            total_cost += results[agent_id]['cost']
            total_imbalance += results[agent_id]['imbalance']
        
        self._cumulative_cost += total_cost
        
        # Compute rewards
        # Reward = -cost - penalty * imbalance
        # Lower cost and imbalance = higher reward
        collective_reward = -(total_cost + self.penalty * total_imbalance)
        
        if self.share_reward:
            # CTDE: All agents get same reward (encourages cooperation)
            rewards = {aid: collective_reward / self.num_microgrids for aid in self.agents}
        else:
            # Independent: Each agent gets its own reward
            rewards = {
                aid: -(results[aid]['cost'] + self.penalty * results[aid]['imbalance'])
                for aid in self.agents
            }
        
        # Get new observations
        observations = {aid: agent.observe() for aid, agent in self.agents_dict.items()}
        
        # Check termination
        done = self._step_count >= self.max_steps
        terminateds = {aid: done for aid in self.agents}
        terminateds['__all__'] = done
        truncateds = {aid: False for aid in self.agents}
        truncateds['__all__'] = False
        
        # Info
        infos = {
            aid: {
                'cost': results[aid]['cost'],
                'imbalance': results[aid]['imbalance'],
                'step': self._step_count,
            }
            for aid in self.agents
        }
        
        return observations, rewards, terminateds, truncateds, infos
    
    def render(self):
        """Print current state."""
        print(f"\nStep {self._step_count}:")
        for aid, agent in self.agents_dict.items():
            bat_soc = agent.battery.state.features['soc'].soc
            gen_out = agent.gen.state.features['output'].p_mw
            print(f"  {aid}: SOC={bat_soc:.2f}, Gen={gen_out:.2f}MW, Load={agent.load:.2f}MW")


print("Environment defined using HERON's PettingZooParallelEnv!")

In [None]:
# Test the environment
env = SimpleMultiMicrogridEnv(num_microgrids=3, max_steps=10, share_reward=True)

print("Environment created!")
print(f"Agents: {env.possible_agents}")
print(f"Observation spaces: {env.observation_spaces}")
print(f"Action spaces: {env.action_spaces}")

# Reset
obs, infos = env.reset()
print(f"\nInitial observations:")
for aid, o in obs.items():
    print(f"  {aid}: {o}")

In [None]:
# Run a few steps with random actions
print("Running 5 steps with random actions...\n")

for step in range(5):
    # Random actions
    actions = {aid: env.action_spaces[aid].sample() for aid in env.agents}
    
    obs, rewards, terminateds, truncateds, infos = env.step(actions)
    
    env.render()
    print(f"  Rewards: {rewards}")
    print(f"  Total cost this step: {sum(infos[aid]['cost'] for aid in env.agents):.2f}")
    print()

## Step 3: Understanding Key Design Decisions

### Why PettingZooParallelEnv (HERON's Adapter)?

HERON's `PettingZooParallelEnv` combines two things:
1. **PettingZoo's ParallelEnv** - Works with RLlib, StableBaselines3, TorchRL
2. **HERON's HeronEnvCore** - Agent management, event-driven execution, messaging

```python
# Raw PettingZoo (DON'T do this)
class MyEnv(ParallelEnv):  # No HERON features
    pass

# HERON adapter (DO this)
class MyEnv(PettingZooParallelEnv):  # Has HERON features
    def __init__(self):
        super().__init__(env_id="my_env")
        self.register_agent(agent)  # HERON agent management
        self._set_agent_ids([...])  # HERON helper
```

### Why `register_agent()`?

This gives you:
- Automatic agent tracking (`heron_agents`, `heron_coordinators`)
- Event-driven execution support (Tutorial 06)
- Message broker integration (distributed mode)

### Why Shared Rewards?

In **cooperative** settings, agents should optimize collective goals:
```python
if self.share_reward:
    # All agents get same reward -> learn to cooperate
    rewards = {aid: collective_reward / num_agents for aid in agents}
```

This is **Centralized Training with Decentralized Execution (CTDE)**.

### Why Penalty for Imbalance?

Power grids must balance supply and demand. The penalty:
```python
reward = -(cost + penalty * imbalance)
```
Encourages agents to coordinate generation with load.

## Step 4: Adding Configuration Support

For production, environments should be configurable via dicts/YAML.

In [None]:
# Example config (like what load_setup() would return)
env_config = {
    'num_microgrids': 3,
    'max_steps': 96,
    'share_reward': True,
    'penalty': 10.0,
    'train': True,
    'centralized': True,
}

# Factory function for RLlib
def create_env(config: Dict) -> SimpleMultiMicrogridEnv:
    """Create environment from config dict."""
    return SimpleMultiMicrogridEnv(
        num_microgrids=config.get('num_microgrids', 3),
        max_steps=config.get('max_steps', 96),
        share_reward=config.get('share_reward', True),
        penalty=config.get('penalty', 10.0),
    )

# Test
env2 = create_env(env_config)
print(f"Created env with {env2.num_microgrids} microgrids, {env2.max_steps} max steps")

## Key Takeaways

1. **Use HERON Adapters, Not Raw PettingZoo**
   - `PettingZooParallelEnv` = ParallelEnv + HeronEnvCore
   - Enables event-driven execution, message broker, agent management

2. **Register Agents with HERON**
   ```python
   self.register_agent(agent)  # Enables HERON features
   self._set_agent_ids([...])  # Setup PettingZoo attributes
   self._init_spaces(...)      # Setup action/observation spaces
   ```

3. **PettingZoo API**
   - `reset()` returns (observations, infos)
   - `step(actions)` returns (obs, rewards, terminateds, truncateds, infos)
   - All dicts keyed by agent_id

4. **Shared Rewards for Cooperation**
   - CTDE: Same reward encourages collective optimization
   - Independent: Each agent optimizes selfishly

5. **Configuration-Driven Design**
   - Factory functions enable easy experimentation
   - YAML/dict configs for reproducibility

---

**Next:** [05_training_with_rllib.ipynb](05_training_with_rllib.ipynb) - Training with MAPPO