### GNN Types
there are different GNN types {GAT, GCN, GraphSAGE, GIN, MPNN}.

* GCN (Graph Convolutional Network): performs convolution on graphs by approximating spectral filtering, enabling nodes to aggregate information from their neighbours.

* GraphSAGE: Samples neighbours and learns aggregation functions
for scalability.

* GAT (Graph Attention Network): Assigns different weights to
neighbours via attention.

* GIN (Graph Isomorphism Network): Maximizes discriminative
power over graph structures.

* MPNN (Message Passing Neural Network): General framework for
molecules and physics with customizable message/update functions.

Nevertheless, For our Heterogeneous Warehouse Routing problem (with Orders, Operators, and explicit edge attributes like Travel Time), the GAT (Graph Attention Network)—specifically GATv2—is the most adequate choice.  

#### Why GAT / GATv2?
* Edge Feature Integration: Our problem relies heavily on Edge Attributes ($T_{ij}$ = Travel Time). Standard GCNs define weights based only on graph topology (degree). GATs compute attention scores $\alpha_{ij}$ based on node features AND edge features.  
Logic example: "An operator A should attend strongly to Order B because the edge time $T_{AB}$ is small (fast), not just because it's a neighbor."

* Anisotropic Aggregation: GCN/GraphSAGE treat all neighbors equally (isotropic). GAT is anisotropic, it learns to prioritize specific important neighbors. In routing, picking the one best next task is crucial; averaging all neighbors washes out the signal.

* GATv2 Improvement (```GATv2Conv```): Standard GAT has a known "static attention" issue where it can't distinguish certain inputs. GATv2 fixes this dynamic range, making it strictly more expressive.

#### What about other GNN types?
- GCN (Graph Convolutional Network)
    * Static Weights: GCN edge weights are fixed by the graph structure (e.g., $1/\sqrt{deg_i deg_j}$). It cannot naturally handle "This edge is 5 minutes" vs "This edge is 50 minutes" without awkward hacks.

    * Oversmoothing: GCNs tend to make node embeddings very similar (smoothing) quickly. For routing, we need distinct embeddings to differentiate "Order A is good" from "Order B is bad."

- GraphSAGE
    * Scalability Focus: GraphSAGE is designed for massive graphs (millions of nodes) where we must sample neighbors. Our sub-problems are small (50-100 nodes).

    * Aggregators: Standard GraphSAGE uses Mean or Max pooling. While effective, it lacks the fine-grained, pair-wise interaction (Attention) needed to solve the precise "matching" problem between a specific operator and a specific order.

- GIN (Graph Isomorphism Network)
    * Graph Classification: GIN is designed to distinguish entire graph structures (e.g., "Is this molecule toxic?"). It is less optimized for Node/Edge Level tasks like picking the next edge in a sequence.

    * Sum Aggregation: It relies on Sum aggregation to capture structure, which can be unstable for optimization tasks where magnitude (e.g., total distance) needs to be minimized rather than just counted.

- MPNN (Message Passing Neural Network)
    * Too Generic: MPNN is a framework, not a specific layer. We could build a custom MPNN that handles edge features perfectly.

    * Development Cost: Writing a raw MPNN requires defining custom $\phi$ and $\rho$ functions. GATv2 (available in PyG) essentially implements a highly optimized, attention-based MPNN out of the box.


### GAT vs GATv2
* Original GAT (GATConv) - Static Attention:
    $$\alpha_{ij} \propto \exp \left( \text{LeakyReLU} \left( \mathbf{a}^T [W h_i \parallel W h_j] \right) \right)$$

    The LeakyReLU is applied after the linear projection $\mathbf{W}$ and concatenation.

    Flaw: The attention scores for node $j$ are effectively "ranked" globally, regardless of who the query node $i$ is. This limits expressiveness (Operator A and Operator B must generally agree on which orders are "good").

* GATv2 (GATv2Conv) - Dynamic Attention:
    $$\alpha_{ij} \propto \exp \left( \mathbf{a}^T \text{LeakyReLU} \left( W [h_i \parallel h_j] \right) \right)$$

    The LeakyReLU is applied before the final dot product with the attention vector $\mathbf{a}$.

    Fix: This small change allows the attention function to be a "universal approximator." Operator A can now strongly prefer Order 1, while Operator B strongly prefers Order 2, purely based on their unique interactions.


### Model Overview

* Input: A Heterogeneous Graph object (data) containing:

    - data['order'].x: Features for orders.

    - data['operator'].x: Features for operators.

    - data['order', 'to', 'order'].edge_attr: Distance/Time between orders.

    - data['operator', 'to', 'order'].edge_attr: Distance/Time from ops to orders.

* Global Context Fusion: The u vector (Alpha, Beta, H_fixed) is concatenated to the final node embeddings, ensuring that the final decision heads know the "rules of the game" (e.g., how expensive delay is vs. distance).

* Encoder: Distinct MLPs to embed initial features of orders and operators into a common hidden dimension. It uses HeteroConv message passing with ```GATv2Conv``` to update both Operator and Order embeddings by exchanging messages, allowing operators to "see" orders and orders to "see" other orders.

    Note that: it explicitly handles the edge_attr (Processing Time & Travel Time) by passing edge_dim=1 to the GAT layers. 


* Decoders (Triple Heads):
    - Activation Head: Predicts which operators to use ($P(y^{act})$).
    - Assignment Head: Predicts valid operator $\to$ order edges ($P(y^{assign})$).
    - Sequence Head: Predicts valid order $\to$ order edges ($P(y^{seq})$).



### Key Architectural Decisions

* Heterogeneous Graph (HeteroConv):  
    Our problem has two distinct entities: Orders and Operators. Using a homogeneous graph (treating them all as generic "nodes") would force the network to "re-learn" which nodes are which.

    HeteroConv allows us to define specific message-passing rules. For example, ('operator', 'to', 'order') messages explicitly represent "Resource Feasibility," while ('order', 'to', 'order') messages represent "Sequence/Travel Logic."

* Edge Features (edge_attr):  
Crucially, the GATv2Conv layers are initialized with edge_dim=hidden_dim. This allows the network to use the Travel Time ($T_{jk}$ or $T_{ij}$) as a core part of the attention mechanism. The network learns to "pay less attention" (lower weight) to orders that are physically far away, mimicking the logic of minimizing travel time ($\alpha$).

* Dual Heads (TO BE EXPLAINED FURTHER):

    - Activation Head: Directly maps the final operator embedding to a probability. Since the embedding has aggregated information from nearby Orders (via the order -> operator layers), the operator "knows" if there is high demand nearby, allowing it to accurately predict its own activation status ($y_i$).

    - Assignment Head: Instead of just dot-producting node embeddings, we concatenate [Op_Emb, Order_Emb, Travel_Time]. This ensures the final decision explicitly accounts for the static distance ($T_{ij}$), which is vital for the Scheduling Loss ($L_{Scheduling}$).

* Scaling: This architecture works on batches of graphs, regadless the graph size. (ex. For the "10,000 orders" scaling plan (Cluster -> Solve), we simply feed this model sub-graphs (clusters) of 50-100 orders. The architecture remains exactly the same).

##### How to Train
1. Data: Load a batch of (Graph, Labels) from our MIP dataset.
2. Forward: pred_act, pred_assign = model(graph.x_dict, ...)
3. Loss: loss = weighted_loss(pred_act, batch.y_activation, pred_assign, batch.y_assignment, alpha, beta)
4. Backward: loss.backward()


### Message Passing
Let $G$=($V$, $E$) be a graph, where $V$ is its node set and $E$}$ represents its edges. Let $N_{u}$ be the neighborhood node $u\in V$. Additionally, let $x_{u}$ be the features of node $u\in V$, and $e_{uv}$ the features of edge $(u,v)\in E$.

Then, we can express the general form of message passing between nodes: images in
[GATs](https://www.baeldung.com/cs/graph-attention-networks)


### Conditioned Message passing 
To handle $H_{fixed}$ (Capacity), $\alpha$ (Makespan weight), and $\beta$ (Activation weight) explicitly, we need to force the network to mathematically respect these parameters. The best approach is to use Conditioned Message Passing (also called Modulated Attention) and Capacity Masking.

#### The architectural strategy to handle these three variables explicitly:

1. Handling Global Objectives ($\alpha, \beta$): The "Objective Lens"

    Instead of just feeding $\alpha$ and $\beta$ as inputs, use them to modulate the edge weights before the message passing starts. Think of this as putting on "tinted glasses" that change how the network sees distance.  

    #### Mechanism: Dynamic Edge Scaling

    In a standard GNN, the edge feature is just $T_{ij}$ (Travel Time). In our Explicit architecture, we transform the edge features based on $\alpha$.

    * Logic: If $\alpha$ (time penalty) is high, long edges should effectively act as "walls" (very low attention scores).

    * Implementation: Inside the Edge Model ($\phi^e$) of the Meta-Layer:

    $$e_{ij}' = \text{MLP}_{\text{edge}}(e_{ij} \cdot (1 + w_1 \cdot \alpha))$$
    Or, more explicitly in the Attention Mechanism:
    $$\text{Attention}_{ij} = \text{Softmax} \left( \frac{h_i^T W h_j - (\gamma \cdot \alpha \cdot T_{ij})}{\sqrt{d}} \right)$$

    Where $T_{ij}$ is the raw travel time. And The term $- (\gamma \cdot \alpha \cdot T_{ij})$ explicitly subtracts from the attention score proportional to the makespan weight.  
    So as a Result: When $\alpha$ is high, the GNN physically cannot attend to distant nodes effectively, forcing it to pick closer neighbors.  

2. Handling Activation Cost ($\beta$): The "Activation Gate"

    $\beta$ dictates the penalty for "waking up" a new operator. We should use this to gate the Activation Head of the network.
    Mechanism: Bias Injection

    The final output layer for the Activation Policy ($P_{activate}$) usually looks like Sigmoid(Wx + b). We can explicitly inject $\beta$ into the bias term.

    $$P_{\text{activate}}(i) = \sigma \left( \text{MLP}(h_i) - (\lambda \cdot \beta \cdot I_{i, \text{s\_new}}) \right)$$

    Where $I_{i, \text{s\_new}}$ is 1 if the operator is currently idle, 0 if already active.

    So as a result: As $\beta$ increases, the output probability is mathematically pushed downward (subtracted). The operator's embedding $\mathbf{h}_i$ must be extremely strong (i.e., "I am perfectly positioned for these orders") to overcome the negative $\beta$ bias.  

3. Handling Capacity ($H_{fixed}$): The "Fuel Gauge"

    $H_{fixed}$ is a hard constraint for the Operator. In a constructive (step-by-step) GNN, simply knowing the static $H_{fixed}$ isn't enough; the model needs to know the Utilization Ratio.  

    #### Mechanism A: Dynamic Feature Engineering (Input Level)

    At every step of the GNN (during the constructive rollout), update the Operator Node features to include:

    $$h_{op} = \left[ \dots, H_{\text{fixed}}, H_{\text{current\_load}}, \frac{H_{\text{current\_load}}}{H_{\text{fixed}}} \right]$$

    The Ratio ($\frac{H_{load}}{H_{fixed}}$): This is the most critical feature. It normalizes the capacity. The GNN learns generic rules like "If Ratio > 0.9, stop assigning."

    #### Mechanism B: Hard Constraint Masking (Output Level)

    While the GNN learns soft rules, we must enforce hard rules during inference (and training, if using simple Greedy rollout).

    When calculating the Assignment Scores (Output Layer), apply a Validity Mask:

    $$\text{Score}(i,j) = 
    \begin{cases} 
    \text{GNN\_Output}(i,j) & \text{if } (H_{\text{current\_load}}(i) + \text{Size}(j)) \leq H_{\text{fixed}}(i) \\
    -\infty & \text{otherwise}
    \end{cases}$$

    Result: It becomes physically impossible for the GNN to assign an order that violates $H_{fixed}$, regardless of what the neural weights say

### Data Loading and Preprocessing

#### Normalization
We normalize travel and processing times (dividing by the max value to put them in the range) for Neural Network Stability, not because of the physics of the warehouse problem.  

##### An example with (e.g., 300 minutes, 1200 minutes) might break GNN training for these reasons:
1. The "Vanishing Gradient" in Attention (Softmax)  

    Our architecture uses Graph Attention (GAT). The core equation for attention coefficients is:
    $$\alpha_{ij} = \text{Softmax} \left( \text{LeakyReLU} \left( \mathbf{a}^T [W h_i \parallel W h_j \parallel \text{Time}_{ij}] \right) \right)$$

    * The Softmax function involves exponentiation ($e^x$).

    * If we feed raw minutes (e.g., $Time = 500$) into the network, the dot product inside the Softmax can become huge (e.g., 500).

    * $e^{500}$ is astronomically large (Infinity in float32).

    * Result: The Softmax "saturates." One neighbor gets probability 1.0, all others get 0.0. The gradients become zero. The network stops learning.

    By normalizing to, the inputs stay in a "safe zone" (e.g., -5 to +5 after weights), keeping gradients flowing.

​
2. Feature Dominance  
    Our node features (Priority, coordinates) are normalized to by ```MinMaxScaler```. If edge features (Time) are raw (e.g., 0 to 1000), the Time feature is 1000x larger than the Priority feature.  
     So as a result, during backpropagation: the network will focus only on the massive Time values and treat Priority/Location as noise (effectively zero). Normalizing ensures all features contribute equally to the initial learning phase.



### References
* https://www.cs.mcgill.ca/~wlh/grl_book/files/GRL_Book.pdf Graph Representation Learning
* https://arxiv.org/abs/1710.10903 GNN Attention
* https://arxiv.org/abs/2105.14491 How Attentive are Graph Attention Networks?