## Multi-criteria Objective Function

### MIP Problem Definition
* Sets: J (Orders/Missions), $I_{max}$ (Max number of operators), U (Order types).
Note that J can be expanded in $J' = J \cup \{\text{0}\}$ in order to have the initial base virual node.
* Variables: $x_{ij}$ (Assignment), $S_{j}$ (Start Time), ${Cj}$ (Completion Time), $y_{i}$ (Operator Activation), $S_{i, first}$(Operator Start Time), $C_{i, last}$(Operator Completion Time), Z (Makespan)	  
Sequencing variable: $z_{ikj}$ (operator i goes from order k to order j)
* Parameters:
  - $P_{ij}$(Processing time) => time of loading and unloading of pallet j.
  - $T{jk}$ (Travel time from j to k) => calculated basing on any possible.combination between two orders i and k.
  - $Q{it}$ (Skill score) => the score that a fork_lift is adequate to pallet dimensions and its enabled to elaborate it.
  - $O{jt}$ (Order type) => contains the pallet/mission type for each order.
  - $H_{fixed}$ (Work hours per operator/machine)
  - M (Big number)

*Note that all time units should be standarized {shift time, operator speed that affect the processing time, ...}, any distances are misured in "meter", any time is mesured in "minute" and speeds are misured in meter/minute.  
*Note that aiming to have a less loss in fork space it'll open another optimization question by adding another criteria in our objective function. So as a semplification, we would ignore this condition.

#### Constaints and Objective function

1. Each order j is assigned to a single operator i
$$\sum_{i \in I{max}} x_{i j} = 1 \quad \forall j \in J$$

2. The type of order j should be handled by an operator i that has skill score >= 1 for such order
 $$\sum_{u \in U} O_{j u} \cdot Q_{i, u} \ge x_{i j} \quad \forall i \in I_{max}, j \in J$$

3. The linking between order k and order j over the same operator i
 $$\sum_{k \in J \cup \{\text{Base}\}} z_{i k j} = x_{i j} \quad \forall i \in I_{max}, j \in J$$
(The operator i must travel to j if and only if i is assigned to j.)

4. the number of flow-ins (j->k) should be equal to the number flow-outs (l->j) of the same operator i for any order j
$$\sum_{k \in J \cup \{\text{Base}\}} z_{i j k} = \sum_{l \in J \cup \{\text{Base}\}} z_{i l j} \quad \forall i \in I_{max}, j \in J \text{ and } k \ne l$$
(Including both directions towards the base)

5. Time sequencing (the start time of order k is after the completion time of the previous linked order j + the travel time to reach order k)
$$S_k \ge C_j + T_{j k} - M \cdot (1 - z_{i j k}) \quad \forall i \in I_{max}, j, k \in J, j \ne k$$
(M is a big number that activate the constraint in case of linking order k and order j; otherwise, the constraint give a very large negative number)  

6. The completion time of order j is the time to start it + the processing time w.r.t. the assigned operator i
$$C_j = S_j + \sum_{i \in I} x_{i, j} \cdot P_{i j} \quad \forall j \in J, i \in I_{max}$$
7. Common MTZ (Miller-Tucker-Zemlin) constraints to prevent isolated loops/subtours
    * Ordering constraint:
      $$u_{i j} - u_{i k} + |J| \cdot z_{i j k} \leq |J| - 1 \quad \forall i, j \neq k$$
      It forces strictly increasing order indices along used arcs, thereby eliminating subtours. Suppose that operator i goes from mission j to mission k, the rank of k must be at least 1 greater than the rank of j. This is the ordering effect: we can only move "forward" in the sequence along a used arc
    * Assignment constraint:
      $$u_{i j} \ge z_{i 0 j}$$
      It prevents “ghost” ordering values on missions that an operator does not actually visit, which keeps the MTZ logic correct and the relaxation tighter.
    * Ordering constraint: 
      $$u_{i j} \geq 1 \cdot z_{i 0 j} \quad \forall i, j$$
      It ensures that the first mission after the base has a positive rank, anchoring the MTZ ordering scale and helping rule out degenerate assignments where a mission appears to be visited but has rank 0.

(Where $u_{i j}$ and $u_{i k}$ are helper variables representing the sequence position of j and k on the route of operator i,   
and $2 \le u_{j} \le |J|+1$.)

8. An order j can only be assigned to operator i if and only if the operator i is activated $(y_{i}=1)$.
$$x_{i j} \le y_{i} \forall i \in I_{max},j \in J$$

9. The time elapsed between an operator's first assigned task and his last assigned task must not exceed the fixed shift length $H_{fixed}$.
$$C_{i,last} - S_{i,first} \le H_{fixed}⋅y_{i} \forall i \in I_{max}$$

    - $C_{i,last}$: The completion time of last task per operator i.

    - $S_{i,first}$: The start time of first task per operator i.

    - Note that: $P_{i,0}=0$, $S_{0}=0, C_{0}=0$

    - (This requires additional constraints to define $C_{i,last}$ and $S_{i,first}$ based on the $z_{ijk}$ sequencing variables and a virtual Base node):
      * $S_{i,first}=S_{k}$ $\text{if } z_{i,0,k}=1$
      * $C_{i,last}=C_{j}+T_{j,0}$ $\text{if } z_{i,j,0}=1$

10. The total number of trips that end at the Base (k=0) must equal the number of operators who are activated ($y_{i}=1$).
$$\sum_{j \in J} z_{ij0} =y_{i} \forall i \in I_{max}$$

11. Symmetry break rule to reduce the number of interchangable solutions/schedules w.r.t. the set of operators, we need to enforce an ordering on operator activation. So that we couldn't build mission schedules for any subsequent operator to the current operator i if operator i hasn't been activated yet.
$$y_{i} \leq y_{i-1} \quad \forall i \in I_{max}, i \neq i_{1}$$

12. Objective function (multi-objective): minimize the overall time to complete all orders (Makespan) AND minimize the fixed cost of labor (the number of operators used).
$$\text{Minimize } Z=(\alpha⋅Z)+(\beta⋅ \sum_{i \in I_{max}}y_{i​})$$

    - Z: The Makespan (maximum completion time, costed by α).

    - $\sum_{i\in I_{max}}y_{i​}$: The total number of operators used (costed by β).

    - $\alpha, \beta$: Weighting factors we choose to define the trade-off. (α high = minimize time; β high = minimize staff).

    - $I_{max}$: The set of all available operators we could potentially use (unactivated operators).

*Base order/mission is at position [0, 0], from which all operator should start the day.  
*The model has O(operators * missions²)

### The trade-off between Makespan and Operator activation
* With more active operators: Missions are split across many routes, each route is shorter, so the makespan decreases. But $\sum y_i \sum y_i$ increases (more operators used).

* With fewer active operators: Missions are concentrated on few routes, and these routes get close to the shift capacity $H_{fixed}$, so makespan increases. But $\sum y_i \sum y_i$ decreases (lower labor / resource cost).

### GNN Problem Definition

The graph $\mathcal{G}=(\mathcal{V}, \mathcal{E})$ must capture all orders, all potential resources, and all spatial relationships necessary for the GNN to make an informed, multi-objective decision.

* Nodes ($\mathcal{V}$)
  - Order Nodes $J$ (up to 10,000): Initial Features ($h_v$) => Location coordinates $(x_j, y_j)$, Required Skill ID, Processing Time ($P_{ij}$).
  - Operator Nodes $I_{max}$ (Pool): Initial Features ($h_v$) => Current Location, Operator Skill Set, Fixed Shift Capacity ($H_{fixed}$), Status (Idle/Busy).

* Edges ($\mathcal{E}$)
  - Order-Order ($\boldsymbol{e_{jk}}$) featured with "Travel Time $T_{jk}$": Represents the static distance map of the warehouse (ex. calculated with A*).
  - Operator-Order ($\boldsymbol{e_{ij}}$) featured with current "Travel Time $T_{i, j}$": Represents the dynamic accessibility of $\text{Order}_j$ by $\text{Operator}_i$ from their current position.



### GNN Architecture
* Message Passing Layers: These layers iteratively compute a contextual node embedding for every order and operator by aggregating information from their neighbors. This allows the GNN to understand complex trade-offs, such as: "This order is close to Operator A, but Operator B is already assigned nearby tasks, making Operator B's route more efficient."B.

  * Dual Processing (Hidden Layers): The GNN's deep Message Passing layers are trained to understand the complex relationship between scheduling efficiency and labor cost.
    - Layer 1: Assignment Policy ($P_{assign}(i, j)$): This requires a set of weights ($W_{assign}$) and biases ($b_{assign}$) to map the GNN's final node embeddings to the assignment scores. This is a large layer, as it covers all $I_{max} \times J$ potential assignments.
    - Layer 2: Activation Policy ($P_{activate}(i)$): This requires a separate set of weights ($W_{activate}$) and biases ($b_{activate}$) to map the operator node embeddings to the activation scores ($y_i$). This layer must learn a distinct function specific to resource capacity.

    During this phase, the network learns features relevant to both:
    - Resource Features: Embeddings learn if an operator is "cost-effective" to activate (reflecting the $\beta$ penalty).
    - Scheduling Features: Embeddings learn the optimal sequence and travel time (reflecting the $\alpha$ penalty).

* Single Output Layer (Assignment Policy $\pi$):
 The final layer outputs only the Assignment Score ($P_{assign}(i, j)$) for every possible $\text{Operator}_i$ to $\text{Order}_j$ pair. This score reflects the GNN's learned opinion on whether that assignment is globally optimal.
 Operators with no associated assignments (missions) are considered "unactivated".

 * Final Weighted Loss ($\mathcal{L}_{\text{Total}}$):  
 The final loss value, which is minimized during training, is the weighted combination of these two aggregated errors:$$\mathcal{L}_{\text{Total}} = \left( \alpha \cdot \mathcal{L}_{\text{Scheduling}} \right) + \left( \beta \cdot \mathcal{L}_{\text{Activation}} \right)$$

    - Total Activation Loss: The total activation error is the sum of local errors across all potential operator nodes:$$L_{{Activation}} = \sum_{i \in I_{\text{max}}} \text{BCE}(\hat{y}_i \text{ vs } y_i^*)$$
    - Total Scheduling Loss: The total scheduling error is the sum of local errors across all potential sequencing flows (edges):$$L_{\text{Scheduling}} = \sum_{i \in I_{\text{max}}} \sum_{j, k \in J,  j \ne k} \text{BCE}(z_{ijk} \text{ vs } z_{ijk}^*)$$

    This process ensures that every decision point in the system—whether it's the strategic choice to activate an operator ($L_{\text{Activation}}$) or the tactical choice of sequence ($L_{\text{Scheduling}}$), contributes to the single error signal used to update the GNN's parameters.

### Imitation Learning
The aforementioned solution is a basic approach of imitation learning based on standard supervised learning. It's called "Offline Behavioral Cloning" because once trained on solved MIP schedules, it's directly used for test. Meanwhile "Interactive IL (DAgger)" is used for online adjustment such that it calls the MIP-side inside the GNN training loop to label the new states the GNN visits whenever the GNN model reaches an infeasibile state (note that the MIP-side need to be adjusted to accept partially solved (fixed-variable) problems).

If we observe the "drifting" problem (ex. the GNN starts well but finishes with terrible routes). We can use a simplified DAgger approach:

- Run the trained (BC) GNN on 100 new batches.
- Save the states where it struggled or produced infeasible/bad results.
- Ask the MIP to solve only those specific difficult states.
- Add this data to our training set and fine-tune. (This is "Offline DAgger" and avoids the live-loop cost).

### Divide-and-Conquer Decomposition
We cannot simply feed a 10,000-node graph into a standard Attention Model or GNN because the memory and compute complexity (often $O(N^2)$ for attention) will explode.

"Cluster-First, Route-Second with GNN"
1. Global Decomposition (Clustering)

    Instead of one massive routing problem, we break the 10,000 missions into manageable clusters.

    - Method: Use a fast spatial clustering algorithm (like K-Means, constrained K-Means, or a space-filling curve sweep) to group missions into clusters of ~50-100 missions.

    - Role: This handles the "Assignment" at a coarse level. It assigns a group of nearby missions to a group of operators (or a single operator if the cluster fits one vehicle).

    - Why: Traditional clustering algorithms handle 10,000 points instantly.

2. The GNN Solver (The "Local Expert")

    Our GNN is trained only on small problems (e.g., 50-100 missions). We do not train it on 10,000 nodes.

    - Training: We train the GNN (using the 2-stage approach discussed before) on random batches of 50-100 missions using our MIP expert.

    - Inference: We apply this trained GNN to each of the ~100-200 clusters generated in Step 1 independently.

    - Benefit: The GNN inference is parallelizable. We can solve all 200 sub-problems simultaneously on a GPU.

3. Global Refinement (Iterative Improvement)

    Simply gluing cluster solutions together might miss optimization opportunities at the boundaries (e.g., a mission in Cluster A is actually closer to Cluster B's route).

    - Method: Use a Large Neighborhood Search (LNS) guided by our GNN.

        * Destroy: Select a subset of missions (e.g., a random region of 50 missions) from the current full solution and remove them.

        * Repair: Feed these 50 missions + relevant operators into our Routing GNN to generate a new, optimized local schedule.

        * Accept: If the new solution is better (lower makespan/cost), keep it.

    - Repeat: Do this for 1,000+ iterations. The GNN replaces the expensive "Greedy Insertion" or "MIP Repair" usually used in LNS.


### How the GNN will respect the $H_{fixed}$?
The GNN must take into account $H_{fixed}$, but it does so differently than the MIP. While the MIP uses $H_{fixed}$ as a "hard" mathematical wall, the GNN learns it as a learned spatial capacity.

To ensure the GNN respects this limit, we must provide $H_{fixed}$ to the model in two specific ways (as an input feature or through the expert labels it studies):  

1. $H_{fixed}$ as a "Global Feature"

    The GNN needs to know the "budget" of time available. If we only give it order locations, it doesn't know if the operators have 4 hours or 8 hours to complete them.

    * Implementation: We should include $H_{fixed}$ as a Global Attribute in our graph.

    * In libraries like PyTorch Geometric, this is often handled by a global feature vector u that is concatenated to the node embeddings during the update phase.

    * Why: This allows the GNN to understand the "density" of the problem. If $H_{fixed}$ is small, the GNN learns that it needs to activate more operators (yi​) to cover the same number of orders.

2. $H_{fixed}$ via "Expert Mimicry" (The Labels)

    The most powerful way the GNN "takes into account" the shift limit is by observing the MIP's behavior.

    * When the MIP solver generates the training data, it strictly obeys the constraint: $C_{i,last} ​\le H_{fixed}$​.

    * If a cluster of orders is too far away to be completed within $H_{fixed}$, the MIP expert will either:

        - Assign those orders to a different, closer operator.

        - Activate an additional operator to share the load.

    * The GNN learns the pattern: By looking at thousands of examples where the MIP "refused" to add more orders to a route that was nearing the $H_{fixed}$ limit, the GNN develops a learned heuristic. It begins to "feel" when a route is getting too long.

### Should we generate the whole dataset for GNN with the same $\alpha$ and $\beta$?
The decision depends entirely on whether we want our final GNN to be a "Specialist" (good at one specific trade-off) or a "Generalist" (able to adjust priorities at runtime).

1. Option A: Fixed Weights (The "Specialist")

    We fix $\alpha$ and $\beta$ (e.g., $\alpha=1, \beta=1$) for all training data generation.

    - Pros: It's easier for training in which the expert policy is consistent. For a given state, the expert always tries to optimize the same goal. The GNN learns this single strategy faster.  
    It also has a higher peak performance: The model specializes in exactly that balance.

    - Cons: It's rigid, if business priorities change (e.g., "It's peak season, forget operator costs, just minimize makespan!"), our model is useless. We must re-generate data and re-train from scratch.

    - Best For: Production systems where the cost function is strict and never changes.

2. Option B: Variable Weights (The "Generalist")
    We randomly sample $\alpha$ and $\beta$ for each batch during data generation (e.g., one batch uses 90/10, another uses 10/90), and we input these weights into the GNN.

    - Pros: It has a runtime control at inference time, we can input $(\alpha=1.0, \beta=0.0)$ to get a fast schedule, or $(\alpha=0.0, \beta=1.0)$ to get a cheap schedule, using the same model.

    - Cons: It's harder to train where the GNN has to learn a mapping that depends on both the graph state and the weights. We must include $[\alpha, \beta]$ as part of the Global Context features in our GNN input. If we don't feed the weights to the GNN, it will receive conflicting signals (same state, different expert actions) and fail to converge.

    - Steps:
        * Data Gen: Pick random α,β for each instance. Solve MIP.

        * GNN Input: Add α and β to the Global Graph Attributes vector.

        * Training: Now the GNN sees: (Graph + [High Alpha]) -> (Label: 10 Operators).

        * Result: The GNN learns the relationship: "When the input Alpha is high, I should activate more operators."



#### References:
* https://jmlr.org/papers/volume24/21-0449/21-0449.pdf
* https://www.sciencedirect.com/science/article/pii/S0305054825001303
* https://proceedings.neurips.cc/paper/2020/file/d1e946f4e67db4b362ad23818a6fb78a-Paper.pdf
* https://imitation.readthedocs.io/en/latest/algorithms/dagger.html

### MIP Constraint Relaxation

 relaxing the integrality requirements (treating integer variables as continuous) or sometimes other specific constraints to simplify a complex MIP problem into a solvable Linear Program (LP).