# Tutorial 3: Neural Models and Classical Policies

**WSmart+ Route Tutorial Series**

This tutorial covers the neural and classical routing policies available in WSmart+ Route. You'll learn:

1. The **ConstructivePolicy** architecture (encoder + decoder)
2. **AttentionModelPolicy** - the primary neural routing model
3. Other neural models: **DeepDecoder**, **PointerNetwork**
4. **Classical policies**: ALNS and HGS
5. **Comparing** neural vs classical approaches

**Previous**: [02_environments.ipynb](02_environments.ipynb) | **Next**: [04_training_with_lightning.ipynb](04_training_with_lightning.ipynb)

In [None]:
import os
import sys
import warnings

warnings.filterwarnings("ignore")

PROJECT_ROOT = os.path.abspath(os.path.join(os.getcwd(), "..", ".."))
if PROJECT_ROOT not in sys.path:
    sys.path.insert(0, PROJECT_ROOT)

import matplotlib.pyplot as plt
import numpy as np
import torch

torch.manual_seed(42)
np.random.seed(42)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

---
## 1. Policy Architecture Overview

WSmart+ Route uses a **constructive** approach to solving routing problems. A policy builds a solution step-by-step:

1. **Encoder**: Process all node features into embeddings (runs once)
2. **Decoder**: At each step, select the next node to visit based on current state and embeddings
3. **Action Selection**: Sample or greedily pick from the probability distribution

```
Input Instance → [Encoder] → Node Embeddings → [Decoder (loop)] → Complete Tour
                                                    ↑
                                              Current State
                                              Action Mask
```

In [None]:
from logic.src.envs import get_env
from logic.src.models.policies import AttentionModelPolicy

# Create environment and policy
env = get_env("vrpp", num_loc=20)
policy = AttentionModelPolicy(
    env_name="vrpp",
    embed_dim=64,           # Embedding dimension
    n_encode_layers=2,      # Number of encoder layers
    n_decode_layers=2,      # Number of decoder layers
    n_heads=4,              # Attention heads
    normalization="instance",
    activation="gelu",
)

# Model summary
total_params = sum(p.numel() for p in policy.parameters())
trainable_params = sum(p.numel() for p in policy.parameters() if p.requires_grad)
print(f"AttentionModelPolicy:")
print(f"  Total parameters:     {total_params:,}")
print(f"  Trainable parameters: {trainable_params:,}")
print(f"  Embedding dim: {64}")
print(f"  Encoder layers: {2}")
print(f"  Attention heads: {4}")

In [None]:
# Generate test instances and run forward pass
td = env.generator(batch_size=8)
td = env.reset(td)

# Greedy decoding (deterministic - always picks highest probability node)
with torch.no_grad():
    out_greedy = policy(td.clone(), env, strategy="greedy", return_actions=True)

print("Greedy decoding output:")
print(f"  Keys: {list(out_greedy.keys())}")
print(f"  Reward shape: {out_greedy['reward'].shape}")
print(f"  Mean reward: {out_greedy['reward'].mean():.4f}")
print(f"  Actions shape: {out_greedy['actions'].shape}")
print(f"  Log-likelihood shape: {out_greedy['log_likelihood'].shape}")

In [None]:
# Sampling decoding (stochastic - samples from probability distribution)
with torch.no_grad():
    out_sampling = policy(td.clone(), env, strategy="sampling", return_actions=True)

print("Decoding Strategy Comparison (8 instances):")
print(f"  Greedy   - Mean reward: {out_greedy['reward'].mean():.4f}")
print(f"  Sampling - Mean reward: {out_sampling['reward'].mean():.4f}")
print()

# Multiple samples to find better solutions
rewards_multi = []
n_samples_list = [1, 4, 8, 16, 32]
for n in n_samples_list:
    sample_rewards = []
    for _ in range(n):
        with torch.no_grad():
            out = policy(td.clone(), env, strategy="sampling", return_actions=True)
        sample_rewards.append(out["reward"])
    # Take best reward across samples for each instance
    stacked = torch.stack(sample_rewards, dim=0)
    best_rewards = stacked.max(dim=0).values
    rewards_multi.append(best_rewards.mean().item())

print("Multi-sample improvement:")
for n, r in zip(n_samples_list, rewards_multi):
    print(f"  {n:3d} samples -> Mean best reward: {r:.4f}")

In [None]:
fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(n_samples_list, rewards_multi, "o-", linewidth=2, markersize=8, color="steelblue")
ax.axhline(y=out_greedy["reward"].mean().item(), color="red", linestyle="--",
           label=f"Greedy baseline: {out_greedy['reward'].mean():.4f}")
ax.set_xlabel("Number of Samples")
ax.set_ylabel("Mean Best Reward")
ax.set_title("Reward vs Number of Samples (untrained model)")
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

---
## 2. Other Neural Policies

WSmart+ Route provides several neural architectures beyond the standard Attention Model.

In [None]:
from logic.src.models.policies import (
    AttentionModelPolicy,
    DeepDecoderPolicy,
    PointerNetworkPolicy,
)

# Create different policies with same embedding dimension
models = {}

models["AM (Attention Model)"] = AttentionModelPolicy(
    env_name="vrpp", embed_dim=64, n_encode_layers=2, n_decode_layers=2, n_heads=4
)

models["DDAM (Deep Decoder)"] = DeepDecoderPolicy(
    env_name="vrpp", embed_dim=64, n_encode_layers=2, n_decode_layers=4, n_heads=4
)

models["Pointer Network"] = PointerNetworkPolicy(
    env_name="vrpp", embed_dim=64, n_heads=4
)

# Compare parameter counts
print("Neural Policy Comparison:")
print(f"{'Model':<25} {'Parameters':>12}")
print("-" * 40)
for name, model in models.items():
    params = sum(p.numel() for p in model.parameters())
    print(f"{name:<25} {params:>12,}")

In [None]:
# Benchmark all models on same instances
td_bench = env.generator(batch_size=32)

print("\nPerformance Comparison (untrained, 32 instances, greedy):")
print(f"{'Model':<25} {'Mean Reward':>12} {'Std':>8}")
print("-" * 50)

results = {}
for name, model in models.items():
    with torch.no_grad():
        out = model(env.reset(td_bench.clone()), env, strategy="greedy", return_actions=True)
    results[name] = out["reward"]
    print(f"{name:<25} {out['reward'].mean():>12.4f} {out['reward'].std():>8.4f}")

In [None]:
fig, ax = plt.subplots(figsize=(9, 5))
names = list(results.keys())
means = [results[n].mean().item() for n in names]
stds = [results[n].std().item() for n in names]

bars = ax.bar(range(len(names)), means, yerr=stds, capsize=5,
              color=["steelblue", "coral", "seagreen"], alpha=0.8, edgecolor="black")
ax.set_xticks(range(len(names)))
ax.set_xticklabels(names, rotation=15, ha="right")
ax.set_ylabel("Mean Reward (greedy)")
ax.set_title("Neural Policy Comparison (Untrained)")
ax.grid(True, alpha=0.3, axis="y")

for bar, mean in zip(bars, means):
    ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.01,
            f"{mean:.3f}", ha="center", va="bottom", fontsize=10)

plt.tight_layout()
plt.show()

---
## 3. Classical Policies

Classical policies use optimization algorithms rather than neural networks. They typically produce higher-quality solutions but are slower.

| Policy | Type | Description |
|--------|------|-------------|
| `ALNSPolicy` | Metaheuristic | Adaptive Large Neighborhood Search - destroy and repair operators |
| `HGSPolicy` | Genetic Algorithm | Hybrid Genetic Search - evolutionary approach with local search |

In [None]:
from logic.src.models.policies.classical.alns import ALNSPolicy

# Create ALNS policy
alns = ALNSPolicy(env_name="vrpp")

# Solve instances
td_classical = env.generator(batch_size=8)
td_classical_reset = env.reset(td_classical.clone())

with torch.no_grad():
    out_alns = alns(td_classical_reset.clone(), env, strategy="greedy", return_actions=True)

print("ALNS Policy Results (8 instances):")
print(f"  Mean reward: {out_alns['reward'].mean():.4f}")
print(f"  Std reward:  {out_alns['reward'].std():.4f}")

In [None]:
from logic.src.models.policies.classical.hgs import HGSPolicy

# Create HGS policy
hgs = HGSPolicy(env_name="vrpp")

with torch.no_grad():
    out_hgs = hgs(td_classical_reset.clone(), env, strategy="greedy", return_actions=True)

print("HGS Policy Results (8 instances):")
print(f"  Mean reward: {out_hgs['reward'].mean():.4f}")
print(f"  Std reward:  {out_hgs['reward'].std():.4f}")

---
## 4. Neural vs Classical Comparison

Comparing untrained neural models against classical optimization algorithms shows the gap that training must close.

In [None]:
# Comprehensive comparison
all_results = {}

# Neural models (untrained)
for name, model in models.items():
    with torch.no_grad():
        out = model(env.reset(td_classical.clone()), env, strategy="greedy", return_actions=True)
    all_results[name] = out["reward"]

# Classical models
all_results["ALNS"] = out_alns["reward"]
all_results["HGS"] = out_hgs["reward"]

print("Full Comparison (8 instances, greedy decoding):")
print(f"{'Method':<25} {'Mean Reward':>12} {'Std':>8}")
print("-" * 50)
for name, rewards in all_results.items():
    marker = " *" if name in ["ALNS", "HGS"] else ""
    print(f"{name:<25} {rewards.mean():>12.4f} {rewards.std():>8.4f}{marker}")
print("\n* = Classical (no training required)")

In [None]:
fig, ax = plt.subplots(figsize=(10, 5))

names = list(all_results.keys())
means = [all_results[n].mean().item() for n in names]
stds = [all_results[n].std().item() for n in names]
colors = ["steelblue", "coral", "seagreen", "orange", "purple"]

bars = ax.bar(range(len(names)), means, yerr=stds, capsize=4,
              color=colors[:len(names)], alpha=0.8, edgecolor="black")
ax.set_xticks(range(len(names)))
ax.set_xticklabels(names, rotation=20, ha="right")
ax.set_ylabel("Mean Reward")
ax.set_title("Neural (Untrained) vs Classical Policies")
ax.grid(True, alpha=0.3, axis="y")

# Add value labels
for bar, mean in zip(bars, means):
    ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.01,
            f"{mean:.3f}", ha="center", va="bottom", fontsize=9)

plt.tight_layout()
plt.show()

---
## 5. Model Internals

Let's peek inside the Attention Model to understand its components.

In [None]:
# Inspect the AM policy structure
am = models["AM (Attention Model)"]

print("AttentionModelPolicy Components:")
print("=" * 50)
for name, module in am.named_children():
    n_params = sum(p.numel() for p in module.parameters())
    print(f"\n  {name} ({type(module).__name__}):")
    print(f"    Parameters: {n_params:,}")
    for sub_name, sub_module in module.named_children():
        sub_params = sum(p.numel() for p in sub_module.parameters())
        print(f"      {sub_name}: {type(sub_module).__name__} ({sub_params:,} params)")

In [None]:
# Visualize parameter distribution
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for ax, (name, model) in zip(axes, list(models.items())[:3]):
    all_params = torch.cat([p.data.flatten() for p in model.parameters()])
    ax.hist(all_params.numpy(), bins=50, alpha=0.7, color="steelblue", edgecolor="white")
    ax.set_title(f"{name}\n({all_params.numel():,} params)")
    ax.set_xlabel("Parameter Value")
    ax.set_ylabel("Count")
    ax.axvline(x=0, color="red", linestyle="--", alpha=0.5)

plt.suptitle("Parameter Value Distributions (Before Training)", fontsize=13, y=1.02)
plt.tight_layout()
plt.show()

---
## Summary

In this tutorial, you learned:

- **ConstructivePolicy** architecture: encoder processes all nodes once, decoder selects nodes step-by-step
- **AttentionModelPolicy** is the primary neural model using multi-head attention
- **DeepDecoderPolicy** uses more decoder layers for richer step-by-step reasoning
- **PointerNetworkPolicy** uses classic RNN-based pointing mechanism
- **ALNS** and **HGS** are classical optimization policies that need no training
- **Untrained** neural models perform worse than classical methods - the gap is closed through RL training (next tutorial)
- **Multi-sample** decoding can improve sampling-based solutions

### Key Insight

Neural models start with random weights and produce poor solutions. Through RL training (Tutorial 4), they learn to match or exceed classical methods while being much faster at inference time.

### Next Steps

Continue to **[Tutorial 4: Training with PyTorch Lightning](04_training_with_lightning.ipynb)** to train these neural models using reinforcement learning.