# Deep Q-Network for Multi-Device Volt-VAR Control

From the [Sisyphean Gridworks ML Playground](https://sgridworks.com/ml-playground/guides/14-advanced-volt-var.html)

## Setup

Clone the repository and install dependencies. Run this cell first.

In [None]:
!git clone https://github.com/SGridworks/Dynamic-Network-Model.git 2>/dev/null || echo 'Already cloned'
%cd Dynamic-Network-Model
!pip install -q pandas numpy matplotlib seaborn scikit-learn xgboost lightgbm pyarrow

## Recap: Why Q-Tables Cannot Scale

In Guide 06, the Q-learning agent used a small table indexed by (voltage_bucket, cap_state). The voltage reading was discretized into 5 buckets, and there were 2 capacitor states, giving a total of 10 entries in the Q-table. This worked because the problem was simple: one continuous reading, one binary control.

In [None]:
import numpy as np

# Guide 06 Q-table: 5 voltage buckets x 2 cap states x 2 actions = 20 entries
q_table_guide06 = np.zeros((5, 2, 2))
print(f"Guide 06 Q-table size: {q_table_guide06.size} entries")

# Now consider the multi-device problem:
#   - 15 monitored buses, each with voltage discretized into 10 buckets
#   - 3 capacitor banks, each ON/OFF (2^3 = 8 combinations)
#   - 2 regulators, each with 33 tap positions (33^2 = 1,089 combinations)
#   - 4 smart inverters, each with 5 VAR setpoints (5^4 = 625 combinations)
n_voltage_states = 10 ** 15        # 10 buckets per bus, 15 buses
n_device_states = 8 * 1089 * 625  # all device combinations
n_actions = 8 * 1089 * 625        # can set any device combination

print(f"\nMulti-device Q-table would need:")
print(f"  State space:  {n_voltage_states * n_device_states:.2e}")
print(f"  Action space: {n_actions:,}")
print(f"  Q-table entries: {n_voltage_states * n_device_states * n_actions:.2e}")
print(f"  That's impossibly large. We need function approximation.")

## Define the Multi-Device VVO Environment

We expand the single-device environment from Guide 06 into a multi-device environment. The state vector now includes continuous voltage readings at all monitored buses, capacitor bank statuses, regulator tap positions, and smart inverter VAR setpoints. Actions control all devices simultaneously.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from demo_data.load_demo_data import load_load_profiles, load_solar_profiles, load_network_nodes

# Load SP&L datasets via the data-loader API
load_profiles = load_load_profiles()        # 15-min load + voltage_pu per feeder
solar_profiles = load_solar_profiles()      # hourly generation curves
network_nodes = load_network_nodes()        # node locations and equipment classes

# Filter to a single feeder for the RL environment
feeder_id = load_profiles["feeder_id"].iloc[0]
feeder_load = load_profiles[load_profiles["feeder_id"] == feeder_id].reset_index(drop=True)
feeder_pv = solar_profiles.reset_index(drop=True)

# Device configuration for feeder FDR-0003
N_BUSES = 15           # monitored voltage measurement points
N_CAPS = 3             # capacitor banks (each ON/OFF)
N_REGS = 2             # regulators (tap positions -16 to +16, discretized to 5 levels)
N_INVERTERS = 4        # smart inverters (5 VAR setpoints each)
REG_TAP_LEVELS = 5     # discretized tap positions: [-8, -4, 0, +4, +8]
INV_VAR_LEVELS = 5     # setpoints: [-100%, -50%, 0%, +50%, +100%] of rated kVAR

# State: voltages + device states (continuous vector)
STATE_DIM = N_BUSES + N_CAPS + N_REGS + N_INVERTERS  # 15 + 3 + 2 + 4 = 24

# Actions: encode as discrete combinations
# Each action selects: cap_combo (2^3=8) x reg_combo (5^2=25) x inv_combo (5^4=625)
# Full action space = 125,000 -- too many for DQN output layer
# Instead, use 27 "adjustment" actions (see below)
N_ACTIONS = 27

print(f"State dimension:  {STATE_DIM}")
print(f"Action space:     {N_ACTIONS} discrete adjustment actions")
print(f"Monitored buses:  {N_BUSES}")
print(f"Control devices:  {N_CAPS} caps + {N_REGS} regs + {N_INVERTERS} inverters")

## Build the DQN Architecture in PyTorch

The core idea of DQN: replace the Q-table with a neural network. The network takes the state vector (24 dimensions) as input and outputs a Q-value for each of the 27 possible actions. The action with the highest predicted Q-value is the one the agent selects.

In [None]:
class MultiDeviceVVOEnv:
    """Multi-device Volt-VAR environment for DQN training.

    State vector (24 dims):
        [0:15]  - voltage p.u. at each monitored bus
        [15:18] - capacitor bank status (0=OFF, 1=ON)
        [18:20] - regulator tap position (normalized to [-1, 1])
        [20:24] - smart inverter VAR setpoint (normalized to [-1, 1])

    Actions (27 discrete):
        Combinations of {raise, hold, lower} for three device groups:
        caps (3 options) x regs (3 options) x inverters (3 options) = 27
    """

    def __init__(self, load_data, pv_data, n_hours=24):
        self.load_data = load_data
        self.pv_data = pv_data
        self.n_hours = n_hours

        # Device state arrays
        self.cap_states = np.zeros(N_CAPS)          # 0 or 1
        self.reg_taps = np.zeros(N_REGS)             # normalized [-1, 1]
        self.inv_setpoints = np.zeros(N_INVERTERS)   # normalized [-1, 1]
        self.hour = 0
        self.day_offset = 0

        # Decode 27 actions into adjustment commands
        self.action_map = []
        for cap_adj in [-1, 0, 1]:
            for reg_adj in [-1, 0, 1]:
                for inv_adj in [-1, 0, 1]:
                    self.action_map.append((cap_adj, reg_adj, inv_adj))

    def reset(self, day_offset=None):
        """Reset to beginning of a 24-hour episode."""
        self.cap_states = np.zeros(N_CAPS)
        self.reg_taps = np.zeros(N_REGS)
        self.inv_setpoints = np.zeros(N_INVERTERS)
        self.hour = 0
        if day_offset is not None:
            self.day_offset = day_offset
        else:
            self.day_offset = np.random.randint(0, len(self.load_data) - self.n_hours)
        return self._get_state()

    def _get_voltages(self):
        """Read voltage from load_profiles and simulate device effects."""
        idx = self.day_offset + self.hour
        # Read the actual voltage_pu from SP&L load profiles
        base_voltage = self.load_data.iloc[idx]["voltage_pu"]
        # Spread across monitored buses with small spatial gradient
        base_v = base_voltage - 0.005 * np.linspace(0, 1, N_BUSES)

        # Simulate device effects on voltage
        # Capacitor: each ON cap boosts voltage by ~0.02 p.u.
        cap_boost = np.sum(self.cap_states) * 0.02
        reg_boost = np.mean(self.reg_taps) * 0.015
        inv_boost = np.mean(self.inv_setpoints) * 0.006
        voltages = base_v + cap_boost + reg_boost + inv_boost
        # Add small random noise to simulate measurement uncertainty
        voltages += np.random.normal(0, 0.002, N_BUSES)
        return np.clip(voltages, 0.85, 1.15)

    def _get_state(self):
        """Build the 24-dim state vector."""
        voltages = self._get_voltages()
        return np.concatenate([
            voltages,
            self.cap_states,
            self.reg_taps,
            self.inv_setpoints
        ]).astype(np.float32)

    def _apply_action(self, action_idx):
        """Apply adjustment action to all device groups."""
        cap_adj, reg_adj, inv_adj = self.action_map[action_idx]
        prev_caps = self.cap_states.copy()
        prev_taps = self.reg_taps.copy()

        # Toggle capacitors: adj=+1 turns next OFF cap ON, adj=-1 turns last ON cap OFF
        if cap_adj == 1:
            off_caps = np.where(self.cap_states == 0)[0]
            if len(off_caps) > 0:
                self.cap_states[off_caps[0]] = 1
        elif cap_adj == -1:
            on_caps = np.where(self.cap_states == 1)[0]
            if len(on_caps) > 0:
                self.cap_states[on_caps[-1]] = 0

        # Adjust regulator taps
        self.reg_taps = np.clip(self.reg_taps + reg_adj * 0.25, -1.0, 1.0)

        # Adjust inverter VAR setpoints
        self.inv_setpoints = np.clip(self.inv_setpoints + inv_adj * 0.25, -1.0, 1.0)

        # Count switching operations for penalty
        n_switches = int(np.sum(self.cap_states != prev_caps))
        n_switches += int(np.sum(self.reg_taps != prev_taps))
        return n_switches

    def step(self, action_idx):
        """Execute one timestep: apply action, advance, return (state, reward, done, info)."""
        n_switches = self._apply_action(action_idx)
        self.hour += 1
        done = self.hour >= self.n_hours

        state = self._get_state()
        voltages = state[:N_BUSES]

        # Compute reward (defined in Step 6)
        reward, info = self._compute_reward(voltages, n_switches)
        info["voltages"] = voltages
        info["hour"] = self.hour

        return state, reward, done, info

    def _compute_reward(self, voltages, n_switches):
        """Multi-objective reward (detailed in Step 6)."""
        # Voltage violation penalty
        violations = np.sum((voltages 0.95) | (voltages > 1.05))
        v_penalty = -5.0 * violations

        # Deviation from 1.0 p.u. (proxy for losses)
        deviation = np.mean((voltages - 1.0) ** 2)
        loss_penalty = -10.0 * deviation

        # Switching penalty
        switch_penalty = -0.5 * n_switches

        # Bonus for all voltages in range
        all_ok = 2.0 if violations == 0 else 0.0

        reward = v_penalty + loss_penalty + switch_penalty + all_ok

        info = {
            "violations": violations,
            "mean_deviation": deviation,
            "n_switches": n_switches,
            "reward_breakdown": {
                "voltage": v_penalty,
                "loss": loss_penalty,
                "switching": switch_penalty,
                "bonus": all_ok
            }
        }
        return reward, info

# Test the environment
env = MultiDeviceVVOEnv(feeder_load, feeder_pv, n_hours=24)
state = env.reset(day_offset=0)
print(f"Initial state shape: {state.shape}")
print(f"Bus voltages: {state[:5].round(4)} ... (first 5 of {N_BUSES})")
print(f"Cap states:   {state[15:18]}")
print(f"Reg taps:     {state[18:20]}")
print(f"Inv setpoints:{state[20:24]}")

## Implement Experience Replay Buffer

In Q-learning (Guide 06), we updated the Q-table immediately after every step. This creates a problem for neural networks: consecutive experiences are highly correlated (hour 3 looks a lot like hour 4), which destabilizes gradient descent. Experience replay stores transitions in a buffer and trains on random mini-batches, breaking temporal correlation.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

class DQNetwork(nn.Module):
    """Deep Q-Network: maps state vector to Q-values for each action.

    Architecture:
        Input  (24) -> Dense(128) -> ReLU -> Dense(128) -> ReLU -> Dense(64) -> ReLU -> Output(27)

    The network learns: Q(state, action) ≈ expected cumulative reward
    for taking 'action' in 'state' and following the optimal policy after.
    """

    def __init__(self, state_dim, n_actions):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 128),
            nn.ReLU(),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, n_actions),
        )

    def forward(self, x):
        """Forward pass: state tensor -> Q-values for all actions."""
        return self.network(x)

# Initialize the Q-network
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
q_network = DQNetwork(STATE_DIM, N_ACTIONS).to(device)

print(f"Device: {device}")
print(f"Network architecture:")
print(q_network)
print(f"\nTotal parameters: {sum(p.numel() for p in q_network.parameters()):,}")

# Test forward pass
test_state = torch.FloatTensor(state).unsqueeze(0).to(device)
q_values = q_network(test_state)
print(f"\nTest Q-values shape: {q_values.shape}")
print(f"Best action: {q_values.argmax(dim=1).item()}")

## Implement the Target Network

DQN uses two copies of the Q-network: the online network (updated every step via gradient descent) and the target network (a frozen copy updated only periodically). The target network provides stable Q-value targets during training, preventing a feedback loop where the network chases its own rapidly changing predictions.

In [None]:
from collections import deque
import random

class ReplayBuffer:
    """Fixed-size buffer to store experience tuples.

    Each experience is (state, action, reward, next_state, done).
    Training samples random mini-batches to break temporal correlation.
    """

    def __init__(self, capacity=50000):
        self.buffer = deque(maxlen=capacity)

    def push(self, state, action, reward, next_state, done):
        """Store a transition."""
        self.buffer.append((state, action, reward, next_state, done))

    def sample(self, batch_size):
        """Sample a random batch of transitions."""
        batch = random.sample(self.buffer, batch_size)
        states, actions, rewards, next_states, dones = zip(*batch)

        return (
            torch.FloatTensor(np.array(states)).to(device),
            torch.LongTensor(actions).to(device),
            torch.FloatTensor(rewards).to(device),
            torch.FloatTensor(np.array(next_states)).to(device),
            torch.FloatTensor(dones).to(device),
        )

    def __len__(self):
        return len(self.buffer)

# Initialize buffer
replay_buffer = ReplayBuffer(capacity=50000)
print(f"Replay buffer initialized (capacity: 50,000 transitions)")

## Define the Multi-Objective Reward Function

The reward function is already implemented in the environment (Step 2), but it deserves a detailed explanation. VVO has three competing objectives that the reward must balance:

In [None]:
import copy

class DQNAgent:
    """DQN Agent with experience replay and target network."""

    def __init__(self, state_dim, n_actions, lr=1e-3, gamma=0.99,
                 epsilon_start=1.0, epsilon_end=0.02, epsilon_decay=0.995,
                 target_update_freq=100, batch_size=64):
        self.n_actions = n_actions
        self.gamma = gamma
        self.epsilon = epsilon_start
        self.epsilon_end = epsilon_end
        self.epsilon_decay = epsilon_decay
        self.target_update_freq = target_update_freq
        self.batch_size = batch_size
        self.train_step = 0

        # Online network: updated every training step
        self.q_network = DQNetwork(state_dim, n_actions).to(device)

        # Target network: frozen copy, updated periodically
        self.target_network = copy.deepcopy(self.q_network)
        self.target_network.eval()  # never in training mode

        self.optimizer = optim.Adam(self.q_network.parameters(), lr=lr)
        self.loss_fn = nn.MSELoss()
        self.replay_buffer = ReplayBuffer(capacity=50000)

    def select_action(self, state):
        """Epsilon-greedy action selection."""
        if np.random.random() return np.random.randint(self.n_actions)
        with torch.no_grad():
            state_t = torch.FloatTensor(state).unsqueeze(0).to(device)
            q_values = self.q_network(state_t)
            return q_values.argmax(dim=1).item()

    def train_on_batch(self):
        """Sample a batch from replay buffer and update the Q-network."""
        if len(self.replay_buffer) return None

        states, actions, rewards, next_states, dones = \
            self.replay_buffer.sample(self.batch_size)

        # Current Q-values: Q(s, a) from online network
        current_q = self.q_network(states).gather(1, actions.unsqueeze(1)).squeeze(1)

        # Target Q-values: r + gamma * max_a' Q_target(s', a')
        with torch.no_grad():
            next_q = self.target_network(next_states).max(dim=1)[0]
            target_q = rewards + self.gamma * next_q * (1 - dones)

        # Compute loss and update
        loss = self.loss_fn(current_q, target_q)
        self.optimizer.zero_grad()
        loss.backward()
        # Gradient clipping to prevent exploding gradients
        torch.nn.utils.clip_grad_norm_(self.q_network.parameters(), max_norm=1.0)
        self.optimizer.step()

        self.train_step += 1
        return loss.item()

    def update_target_network(self):
        """Copy online network weights to target network."""
        self.target_network.load_state_dict(self.q_network.state_dict())

    def decay_epsilon(self):
        """Reduce exploration rate."""
        self.epsilon = max(self.epsilon_end, self.epsilon * self.epsilon_decay)

# Initialize the agent
agent = DQNAgent(
    state_dim=STATE_DIM,
    n_actions=N_ACTIONS,
    lr=1e-3,
    gamma=0.99,
    epsilon_start=1.0,
    epsilon_end=0.02,
    epsilon_decay=0.995,
    target_update_freq=100,
    batch_size=64,
)
print("DQN Agent initialized.")
print(f"  Online network params:  {sum(p.numel() for p in agent.q_network.parameters()):,}")
print(f"  Target network params:  {sum(p.numel() for p in agent.target_network.parameters()):,}")
print(f"  Target update every:    {agent.target_update_freq} episodes")

## Train the DQN Agent

Each training episode simulates a full 24-hour day. The agent starts with random exploration (high epsilon) and gradually shifts to exploiting its learned policy. We train for 500 episodes, which represents 500 simulated days of VVO operation.

In [None]:
# Reward function breakdown (from MultiDeviceVVOEnv._compute_reward):
#
# 1. VOLTAGE VIOLATION PENALTY: -5.0 per bus outside [0.95, 1.05] p.u.
#    This is the primary safety constraint. ANSI C84.1 Range A requires
#    service voltage within +/- 5% of nominal. Violations can damage
#    customer equipment and trigger regulatory penalties.
#
# 2. LOSS MINIMIZATION: -10.0 * mean((V - 1.0)^2)
#    Voltage deviation from nominal is a proxy for reactive power losses.
#    Keeping voltage close to 1.0 p.u. across the feeder minimizes I^2R
#    losses and improves efficiency (conservation voltage reduction).
#
# 3. SWITCHING PENALTY: -0.5 per switching operation
#    Excessive switching wears out mechanical equipment (cap bank switches,
#    regulator tap changers). Utilities limit operations to ~6 per day.
#    This penalty encourages the agent to find stable setpoints.
#
# 4. COMPLIANCE BONUS: +2.0 when ALL buses are within ANSI limits
#    Rewards the agent for achieving the primary objective.

# Demonstrate the reward components on a sample step
env = MultiDeviceVVOEnv(feeder_load, feeder_pv, n_hours=24)
state = env.reset(day_offset=0)

# Take a "do nothing" action (hold all devices)
hold_action = 13  # (0, 0, 0) = hold caps, hold regs, hold inverters
next_state, reward, done, info = env.step(hold_action)

print("Reward breakdown for 'hold all' action:")
for component, value in info["reward_breakdown"].items():
    print(f"  {component:<12s}: {value:+.3f}")
print(f"  {'TOTAL':<12s}: {reward:+.3f}")
print(f"\nVoltage violations: {info['violations']} of {N_BUSES} buses")
print(f"Mean V deviation:   {info['mean_deviation']:.6f}")
print(f"Switching ops:      {info['n_switches']}")

## Evaluate: DQN vs Rule-Based vs Q-Learning

Run all three controllers on the same 30-day evaluation window and compare three key operational metrics: voltage violation minutes, total deviation (proxy for losses), and number of switching operations.

In [None]:
# Training configuration
N_EPISODES = 500
LOG_INTERVAL = 50

# Tracking metrics
episode_rewards = []
episode_violations = []
episode_losses = []

env = MultiDeviceVVOEnv(feeder_load, feeder_pv, n_hours=24)

for ep in range(N_EPISODES):
    state = env.reset()
    total_reward = 0
    total_violations = 0
    ep_losses = []

    for t in range(24):
        # Select action with epsilon-greedy policy
        action = agent.select_action(state)

        # Execute in environment
        next_state, reward, done, info = env.step(action)

        # Store transition in replay buffer
        agent.replay_buffer.push(
            state, action, reward, next_state, float(done)
        )

        # Train on a random batch
        loss = agent.train_on_batch()
        if loss is not None:
            ep_losses.append(loss)

        total_reward += reward
        total_violations += info["violations"]
        state = next_state

    # Update target network periodically
    if (ep + 1) % agent.target_update_freq == 0:
        agent.update_target_network()

    # Decay exploration
    agent.decay_epsilon()

    # Track metrics
    episode_rewards.append(total_reward)
    episode_violations.append(total_violations)
    episode_losses.append(np.mean(ep_losses) if ep_losses else 0)

    if (ep + 1) % LOG_INTERVAL == 0:
        avg_reward = np.mean(episode_rewards[-LOG_INTERVAL:])
        avg_viols = np.mean(episode_violations[-LOG_INTERVAL:])
        print(f"Episode {ep+1:>4}/{N_EPISODES}  "
              f"Avg Reward: {avg_reward:>7.1f}  "
              f"Avg Violations: {avg_viols:>5.1f}  "
              f"Epsilon: {agent.epsilon:.3f}  "
              f"Loss: {episode_losses[-1]:.4f}")

print(f"\nTraining complete. Buffer size: {len(agent.replay_buffer):,}")

## Test Generalization on High-Variability Days

A critical question for any ML-based controller: does it work on conditions it has never seen? Cloud transients cause rapid swings in solar generation, creating voltage fluctuations that stress VVO controllers. We evaluate the trained DQN on days with the highest PV variability in the dataset.

In [None]:
# Plot training curves
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Reward curve
ax = axes[0]
ax.plot(episode_rewards, alpha=0.3, color="#5FCCDB")
ax.plot(pd.Series(episode_rewards).rolling(20).mean(),
       color="#1C4855", linewidth=2, label="20-episode avg")
ax.set_xlabel("Episode")
ax.set_ylabel("Total Episode Reward")
ax.set_title("DQN Training: Reward")
ax.legend()

# Violation curve
ax = axes[1]
ax.plot(episode_violations, alpha=0.3, color="#fc8181")
ax.plot(pd.Series(episode_violations).rolling(20).mean(),
       color="#c53030", linewidth=2, label="20-episode avg")
ax.set_xlabel("Episode")
ax.set_ylabel("Total Voltage Violations")
ax.set_title("DQN Training: Violations")
ax.legend()

# Loss curve
ax = axes[2]
ax.plot(episode_losses, alpha=0.3, color="#fbd38d")
ax.plot(pd.Series(episode_losses).rolling(20).mean(),
       color="#d69e2e", linewidth=2, label="20-episode avg")
ax.set_xlabel("Episode")
ax.set_ylabel("Mean MSE Loss")
ax.set_title("DQN Training: Loss")
ax.legend()

plt.suptitle("DQN Training Progress for Multi-Device VVO", fontsize=14)
plt.tight_layout()
plt.show()

## Model Persistence and Hyperparameter Justification

Save the trained DQN weights so you can deploy the agent or resume training later without retraining from scratch.

In [None]:
def evaluate_dqn(agent, env, n_days=30):
    """Run trained DQN agent for n_days and collect metrics."""
    all_violations = 0
    all_deviation = 0.0
    all_switches = 0
    all_rewards = 0.0
    hourly_voltages = []

    for day in range(n_days):
        state = env.reset(day_offset=day * 24)
        for t in range(24):
            with torch.no_grad():
                state_t = torch.FloatTensor(state).unsqueeze(0).to(device)
                action = agent.q_network(state_t).argmax(dim=1).item()
            state, reward, done, info = env.step(action)
            all_violations += info["violations"]
            all_deviation += info["mean_deviation"]
            all_switches += info["n_switches"]
            all_rewards += reward
            hourly_voltages.append(info["voltages"].mean())

    return {
        "violation_minutes": all_violations * 60,  # each hour step = 60 min
        "total_deviation": all_deviation,
        "switching_ops": all_switches,
        "total_reward": all_rewards,
        "hourly_voltages": hourly_voltages,
    }

def evaluate_rule_based(env, n_days=30):
    """Run simple rule-based controller from Guide 06."""
    all_violations = 0
    all_deviation = 0.0
    all_switches = 0
    all_rewards = 0.0
    hourly_voltages = []

    for day in range(n_days):
        state = env.reset(day_offset=day * 24)
        for t in range(24):
            mean_v = state[:N_BUSES].mean()
            # Rule: raise if low, lower if high, hold otherwise
            if mean_v 0.97:
                action = 26  # raise all: (+1, +1, +1)
            elif mean_v > 1.03:
                action = 0   # lower all: (-1, -1, -1)
            else:
                action = 13  # hold all: (0, 0, 0)
            state, reward, done, info = env.step(action)
            all_violations += info["violations"]
            all_deviation += info["mean_deviation"]
            all_switches += info["n_switches"]
            all_rewards += reward
            hourly_voltages.append(info["voltages"].mean())

    return {
        "violation_minutes": all_violations * 60,
        "total_deviation": all_deviation,
        "switching_ops": all_switches,
        "total_reward": all_rewards,
        "hourly_voltages": hourly_voltages,
    }

# Run evaluations
env_eval = MultiDeviceVVOEnv(feeder_load, feeder_pv, n_hours=24)

dqn_metrics = evaluate_dqn(agent, env_eval, n_days=30)
rule_metrics = evaluate_rule_based(env_eval, n_days=30)

# Display comparison table
comparison = pd.DataFrame({
    "Metric": ["Violation Minutes", "Total Deviation (loss proxy)",
              "Switching Operations", "Total Reward"],
    "Rule-Based": [
        f"{rule_metrics['violation_minutes']:,.0f}",
        f"{rule_metrics['total_deviation']:.3f}",
        f"{rule_metrics['switching_ops']}",
        f"{rule_metrics['total_reward']:.1f}",
    ],
    "Q-Learning (Guide 06)": [
        "~840", "~4.2", "~95", "~620"  # approximate values from Guide 06 (exact numbers depend on Q-learning training run)
    ],
    "DQN (This Guide)": [
        f"{dqn_metrics['violation_minutes']:,.0f}",
        f"{dqn_metrics['total_deviation']:.3f}",
        f"{dqn_metrics['switching_ops']}",
        f"{dqn_metrics['total_reward']:.1f}",
    ],
})
print("30-Day Evaluation Comparison")
print("=" * 70)
print(comparison.to_string(index=False))

## What You Built and Next Steps

In [None]:
# Plot a sample day comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5), sharey=True)

hours = range(24)

# Rule-based
ax = axes[0]
ax.plot(hours, rule_metrics["hourly_voltages"][:24], "o-", color="#2D6A7A",
       markersize=5, label="Mean bus voltage")
ax.axhspan(0.95, 1.05, alpha=0.1, color="green")
ax.axhline(1.0, color="gray", linestyle=":", alpha=0.5)
ax.set_title("Rule-Based Controller")
ax.set_xlabel("Hour of Day")
ax.set_ylabel("Mean Voltage (p.u.)")
ax.legend()

# DQN
ax = axes[1]
ax.plot(hours, dqn_metrics["hourly_voltages"][:24], "o-", color="#5FCCDB",
       markersize=5, label="Mean bus voltage")
ax.axhspan(0.95, 1.05, alpha=0.1, color="green")
ax.axhline(1.0, color="gray", linestyle=":", alpha=0.5)
ax.set_title("DQN Controller")
ax.set_xlabel("Hour of Day")
ax.legend()

plt.suptitle("VVO Controller Comparison: Day 1", fontsize=14)
plt.tight_layout()
plt.show()

In [None]:
# Find the most variable PV days (cloud transients)
feeder_pv["date"] = pd.to_datetime(feeder_pv["timestamp"]).dt.date
daily_pv_std = feeder_pv.groupby("date")["clear_sky_factor"].std()
high_var_days = daily_pv_std.nlargest(10)

print("Top 10 highest PV variability days (cloud transients):")
print(high_var_days)

# Evaluate DQN on these challenging days
hard_dqn_violations = []
hard_rule_violations = []

for day_date in high_var_days.index:
    # Find the day offset in the time series
    day_mask = feeder_pv["date"] == day_date
    if day_mask.sum() 24:
        continue
    day_start = feeder_pv[day_mask].index[0]

    # DQN evaluation
    state = env_eval.reset(day_offset=day_start)
    day_viols_dqn = 0
    for t in range(24):
        with torch.no_grad():
            state_t = torch.FloatTensor(state).unsqueeze(0).to(device)
            action = agent.q_network(state_t).argmax(dim=1).item()
        state, _, _, info = env_eval.step(action)
        day_viols_dqn += info["violations"]
    hard_dqn_violations.append(day_viols_dqn)

    # Rule-based evaluation
    state = env_eval.reset(day_offset=day_start)
    day_viols_rule = 0
    for t in range(24):
        mean_v = state[:N_BUSES].mean()
        if mean_v 0.97:
            action = 26
        elif mean_v > 1.03:
            action = 0
        else:
            action = 13
        state, _, _, info = env_eval.step(action)
        day_viols_rule += info["violations"]
    hard_rule_violations.append(day_viols_rule)

# Compare on hard days
fig, ax = plt.subplots(figsize=(10, 5))
x = np.arange(len(hard_dqn_violations))
width = 0.35
ax.bar(x - width/2, hard_rule_violations, width,
     label="Rule-Based", color="#2D6A7A")
ax.bar(x + width/2, hard_dqn_violations, width,
     label="DQN", color="#5FCCDB")
ax.set_xlabel("High-Variability Day (ranked by PV std)")
ax.set_ylabel("Voltage Violations (bus-hours)")
ax.set_title("Generalization Test: DQN vs Rule-Based on Unseen Cloud Transient Days")
ax.legend()
ax.set_xticks(x)
ax.set_xticklabels([f"Day {i+1}" for i in x])
plt.tight_layout()
plt.show()

print(f"\nHigh-variability day results:")
print(f"  Rule-based avg violations: {np.mean(hard_rule_violations):.1f} bus-hours/day")
print(f"  DQN avg violations:        {np.mean(hard_dqn_violations):.1f} bus-hours/day")
print(f"  DQN reduction:             {(1 - np.mean(hard_dqn_violations)/np.mean(hard_rule_violations))*100:.0f}%")

In [None]:
# Save trained DQN weights
torch.save(agent.q_network.state_dict(), "vvo_dqn.pt")

# Load: agent.q_network.load_state_dict(torch.load("vvo_dqn.pt"))