### Model Overview

* Input: A Heterogeneous Graph object (data) containing:

    - data['order'].x: Features for orders.

    - data['operator'].x: Features for operators.

    - data['order', 'to', 'order'].edge_attr: Distance/Time between orders.

    - data['operator', 'to', 'order'].edge_attr: Distance/Time from ops to orders.

* Encoder: Distinct MLPs to embed initial features of Orders and Operators into a common hidden dimension.

* Processor (Message Passing): Multi-layer Heterogeneous Graph Attention (HeteroGAT) or HeteroGraphConv. This allows Operators to "see" Orders and Orders to "see" other Orders.

* Decoders (Dual Heads):

    - Activation Head: Predicts probability $P(activate)$ for each operator node.

    - Assignment/Scheduling Head: Predicts a score $P(assign)$ for every (Operator, Order) edge.


### Key Architectural Decisions

* Heterogeneous Graph (HeteroConv):  
    Our problem has two distinct entities: Orders and Operators. Using a homogeneous graph (treating them all as generic "nodes") would force the network to "re-learn" which nodes are which.

    HeteroConv allows us to define specific message-passing rules. For example, ('operator', 'to', 'order') messages explicitly represent "Resource Feasibility," while ('order', 'to', 'order') messages represent "Sequence/Travel Logic."

* Edge Features (edge_attr):  
Crucially, the GATv2Conv layers are initialized with edge_dim=hidden_dim. This allows the network to use the Travel Time ($T_{jk}$ or $T_{ij}$) as a core part of the attention mechanism. The network learns to "pay less attention" (lower weight) to orders that are physically far away, mimicking the logic of minimizing travel time ($\alpha$).

* Dual Heads (TO BE EXPLAINED FURTHER):

    - Activation Head: Directly maps the final operator embedding to a probability. Since the embedding has aggregated information from nearby Orders (via the order -> operator layers), the operator "knows" if there is high demand nearby, allowing it to accurately predict its own activation status ($y_i$).

    - Assignment Head: Instead of just dot-producting node embeddings, we concatenate [Op_Emb, Order_Emb, Travel_Time]. This ensures the final decision explicitly accounts for the static distance ($T_{ij}$), which is vital for the Scheduling Loss ($L_{Scheduling}$).

* Scaling: This architecture works on batches of graphs, regadless the graph size. (ex. For the "10,000 orders" scaling plan (Cluster -> Solve), we simply feed this model sub-graphs (clusters) of 50-100 orders. The architecture remains exactly the same).

##### How to Train
1. Data: Load a batch of (Graph, Labels) from our MIP dataset.
2. Forward: pred_act, pred_assign = model(graph.x_dict, ...)
3. Loss: loss = weighted_loss(pred_act, batch.y_activation, pred_assign, batch.y_assignment, alpha, beta)
4. Backward: loss.backward()


### Conditioned Message passing 
To handle $H_{fixed}$ (Capacity), $\alpha$ (Makespan weight), and $\beta$ (Activation weight) explicitly, we need to force the network to mathematically respect these parameters. The best approach is to use Conditioned Message Passing (also called Modulated Attention) and Capacity Masking.

#### The architectural strategy to handle these three variables explicitly:

1. Handling Global Objectives ($\alpha, \beta$): The "Objective Lens"

    Instead of just feeding $\alpha$ and $\beta$ as inputs, use them to modulate the edge weights before the message passing starts. Think of this as putting on "tinted glasses" that change how the network sees distance.  

    #### Mechanism: Dynamic Edge Scaling

    In a standard GNN, the edge feature is just $T_{ij}$ (Travel Time). In our Explicit architecture, we transform the edge features based on $\alpha$.

    * Logic: If $\alpha$ (time penalty) is high, long edges should effectively act as "walls" (very low attention scores).

    * Implementation: Inside the Edge Model ($\phi^e$) of the Meta-Layer:

    $$e_{ij}' = \text{MLP}_{\text{edge}}(e_{ij} \cdot (1 + w_1 \cdot \alpha))$$
    Or, more explicitly in the Attention Mechanism:
    $$\text{Attention}_{ij} = \text{Softmax} \left( \frac{h_i^T W h_j - (\gamma \cdot \alpha \cdot T_{ij})}{\sqrt{d}} \right)$$

    Where $T_{ij}$ is the raw travel time. And The term $- (\gamma \cdot \alpha \cdot T_{ij})$ explicitly subtracts from the attention score proportional to the makespan weight.  
    So as a Result: When $\alpha$ is high, the GNN physically cannot attend to distant nodes effectively, forcing it to pick closer neighbors.  

2. Handling Activation Cost ($\beta$): The "Activation Gate"

    $\beta$ dictates the penalty for "waking up" a new operator. We should use this to gate the Activation Head of the network.
    Mechanism: Bias Injection

    The final output layer for the Activation Policy ($P_{activate}$) usually looks like Sigmoid(Wx + b). We can explicitly inject $\beta$ into the bias term.

    $$P_{\text{activate}}(i) = \sigma \left( \text{MLP}(h_i) - (\lambda \cdot \beta \cdot I_{i, \text{s\_new}}) \right)$$

    Where $I_{i, \text{s\_new}}$ is 1 if the operator is currently idle, 0 if already active.

    So as a result: As $\beta$ increases, the output probability is mathematically pushed downward (subtracted). The operator's embedding $\mathbf{h}_i$ must be extremely strong (i.e., "I am perfectly positioned for these orders") to overcome the negative $\beta$ bias.  

3. Handling Capacity ($H_{fixed}$): The "Fuel Gauge"

    $H_{fixed}$ is a hard constraint for the Operator. In a constructive (step-by-step) GNN, simply knowing the static $H_{fixed}$ isn't enough; the model needs to know the Utilization Ratio.  

    #### Mechanism A: Dynamic Feature Engineering (Input Level)

    At every step of the GNN (during the constructive rollout), update the Operator Node features to include:

    $$h_{op} = \left[ \dots, H_{\text{fixed}}, H_{\text{current\_load}}, \frac{H_{\text{current\_load}}}{H_{\text{fixed}}} \right]$$

    The Ratio ($\frac{H_{load}}{H_{fixed}}$): This is the most critical feature. It normalizes the capacity. The GNN learns generic rules like "If Ratio > 0.9, stop assigning."

    #### Mechanism B: Hard Constraint Masking (Output Level)

    While the GNN learns soft rules, we must enforce hard rules during inference (and training, if using simple Greedy rollout).

    When calculating the Assignment Scores (Output Layer), apply a Validity Mask:

    $$\text{Score}(i,j) = 
    \begin{cases} 
    \text{GNN\_Output}(i,j) & \text{if } (H_{\text{current\_load}}(i) + \text{Size}(j)) \leq H_{\text{fixed}}(i) \\
    -\infty & \text{otherwise}
    \end{cases}$$

    Result: It becomes physically impossible for the GNN to assign an order that violates $H_{fixed}$, regardless of what the neural weights say