### New Alternatives:
Monolithic MIP solver has a limit for 50~100 orders at most, making it even undesirable in generating training data, because of the few number of training example. So as alternative, we could consider:
- Decomposition technique that guarantee the optimality (column generation):  
    The problem structure is a Set Covering Problem (or Set Partitioning) at the top level, which is the master problem framework for column generation.  
    The primary difficulty in Vehicle Routing Problems (VRP) is the massive number of possible routes (complicating variables). A forklift route that services 10 orders can be arranged in 10! ways. If we have 100 orders, we'll have an astronomical number of possible routes, which no monolithic solver can handle.

    * Column Generation Solution: The algorithm generates only the few, most valuable routes (columns) needed to cover all orders optimally, considering this type of decomposition:  

        - Master Problem: Decides which routes (columns) to select to cover all orders while minimizing the number of operators (fleet size).

        - Subproblem (Pricing Problem): Finds a single, most profitable route (new column) to add to the Master Problem's choice set. This subproblem is a Shortest Path Problem with Resource Constraints (SPPRC).

    * Handling the Objectives:

        - Minimize Operators: This is the natural objective of the Master Problem (a Set Covering formulation).

        - Minimize Makespan: This constraint can be handled by adding a time limit (resource constraint) in the Subproblem (the SPPRC), ensuring only routes below a maximum duration are generated.

- MIP Column Generation + GNN:
    We might use a decomposed MIP (CG) model to generate a larger number of training examples in order to train the GNN model, then the GNN is used to solve large-scale problems.
    Note that we can use the decomposed MIP (CG) model directly to solve problems with n.missions < 500.

### What about MIP "Benders Decomposition"?
It's the inverse of CG. The Master Problem makes a high-level plan (e.g., "Assign Order set A to Forklift 1"). The Subproblem checks if that plan is actually possible or efficient.

If the Master's plan is bad, the Subproblem returns a Cut (Constraint).

- The Subproblem effectively says: "our plan for Forklift 1 forces it to travel 50km. we estimated 10km. Add a new constraint: If we assign Set A to Forklift 1, Cost >= 50."

- The Master resolves with this new knowledge. It now "knows" the true cost of that assignment and will avoid it or account for it correctly in the next iteration.

The "Awareness" Exchange

The optimality is preserved because they communicate in loop until no new information can improve the solution.

Master/Subproblem Split: If we try to use Benders:

- Master Problem: Decide which orders go to which operator (Assignment).

- Subproblem: Sequence those assigned orders for each operator to find the makespan (This is a complex TSP/VRP).

*The Problem: The Subproblem is still a difficult integer problem (TSP), which defeats the purpose of Benders. Benders works best when the Subproblem is a clean, solvable Linear Program (LP).

### What about "2-stage MIP decomposition"?
it introduces a mathematical flaw known as the decoupling penalty.

Here is the detailed comparison of why the Monolithic approach is superior for our specific multi-objective problem. 

#### The Flaw of Decomposition: "Blind" Decisions

In a decomposed approach, we solve the problem in two isolated stages:

- Stage 1 (Activation MIP): Calculates the minimum operators needed based on Processing Time Capacity (Pij​).

- Stage 2 (Scheduling MIP/GNN): Tries to route those operators to minimize Makespan (Z).

#### The Problem: Stage 1 is "blind" to the routing reality.
It assumes travel time is negligible or average.  It might select 5 operators because they technically fit the processing hours.

However, when Stage 2 runs, it discovers that due to the spatial distribution of orders, 5 operators cannot physically visit all locations within the shift limit (Hfixed​) without massive delays.

* Result: we get an infeasible or highly inefficient schedule because the constraint ($H_{fixed}$​) was violated by the hidden travel costs.

### Dual Variables instability in CG
Dual instability or oscillation, refers to a critical computational bottleneck that often plagues the standard Column Generation (CG) algorithm, especially when applied to large-scale set partitioning problems like the Vehicle Routing Problem (VRP).

This phenomenon significantly slows down the convergence of the algorithm, forcing it to perform many thousands of iterations unnecessarily.

Here is an explanation of the cause, effect, and the advanced strategies used to combat this instability:
##### The Cause: Primal Degeneracy and Multiple Dual Optima

The oscillation of dual variables is rooted in the mathematical properties of the Master Problem's Linear Programming (LP) relaxation:

- Primal Degeneracy: In large-scale VRP formulations, the Restricted Master Problem (RMP) often exhibits primal degeneracy. This means that the optimal solution to the RMP (the values of the route variables, $\lambda_{r}$​) can be represented by multiple different sets of basic variables in the simplex tableau, yet they all yield the exact same objective value.

- Multiple Dual Optima: A degenerate primal solution implies that there exists an entire set of optimal dual solutions (the dual variables, αi​, πv​, etc.). The standard simplex method, used to solve the RMP, may jump between these different optimal dual solutions from one iteration to the next.

This jumping causes the dual variables to oscillate wildly rather than smoothly converging toward a single stable value, a situation described as dual instability.  

##### The Effect: Slow Convergence (Tailing Off)

The wild oscillation of the dual variables has two severe computational consequences:

- Tailing-Off Effect: The total objective function value of the RMP improves rapidly at the beginning of the CG process but then slows down drastically, often requiring hundreds or thousands of iterations to achieve a tiny marginal improvement or prove final optimality. This is known as the "tailing-off" effect.

- Generation of "Extreme Columns": When the dual variables are oscillating, they are far from the true, stable optimal dual solution. This means that the Pricing Problem (Subproblem) often generates "extreme columns" (routes) that have a negative reduced cost in the current iteration but are not part of the ultimate optimal solution. These useless columns must be carried along in the RMP, consuming memory and computation time without contributing to the final result.

The instability fundamentally prevents the algorithm from efficiently proving that no more profitable routes exist (i.e., that $\text{min } c_{r} \ge 0$).

##### The Solution: Dual Stabilization Techniques

To overcome dual instability, advanced stabilization techniques are employed to guide the dual variables towards a stable "center," or a consensus value, rather than allowing them to jump randomly:

1. Defining a Stability Center: The goal is to generate columns using dual variables that are close to the estimated optimal dual values, known as the stability center.

2. Boxstep Methods: These techniques constrain the dual variables (λ) at each iteration to lie within a "box" centered around the previous, successful dual solution. This restriction prevents the wild swings, forcing the next dual solution to remain close to the last one.

3. Stabilized Column Generation (SCG): More sophisticated methods, often based on Augmented Lagrangian or Proximal Point theory, explicitly introduce a stabilizing term into the dual objective function. This term penalizes dual solutions that are far from the stability center, mathematically forcing a smoother path toward convergence. This typically requires introducing additional bounded variables into the Restricted Master Problem (RMP) to manage the penalty.

### How MIP decomposition guarantees the optimality?
It's reasonable to say that after decomposition in master (which minimizes the number of operators to activate) and subproblem (which minimizes the makespan of a sequence of orders), the subproblem just finds "the shortest route" and ignores the fleet size, so it may not considered as an optimal approach!

Effectively, the decomposition through column generation of a monolothic MIP can reach the optimality. It shall follow an iterative process in which the subproblem's "Reduced Cost" need to converge towards a non-negative value in the entire solution space (per any column); otherwise, it'll produce only a bound, not the optimum.

The Master Problem sends Dual Values (Shadow Prices) to the Subproblem, so it work as the following:

  * If Order A is currently uncovered and "expensive" to cover in the Master, the Master assigns it a high "reward" (dual value).

  * The Subproblem changes its objective. It no longer looks for the shortest route; it looks for the most profitable route (Reduced Cost).

  * Formula: The Subproblem optimizes $Cost_{route} - \sum Reward_{orders}$.

  * Result: The Subproblem effectively says, "I found a weird, long route, but it covers these 3 difficult orders we are struggling with." This perfectly aligns the local search with the global goal.

### Even for decomposed MIP in the optimal CG, remains anyway the challenge of common large-scale VRP (10,000 Orders)
The VRP is an NP-hard problem, and the number of potential routes grows exponentially with the number of customers. Even for instances with only 15 customers, the number of possible routes is staggering. A problem with 10,000 orders presents two major computational barriers:

  * Exact Solution Intractability: Exact methods like Branch-and-Price (B&P), which rely on CG, are typically limited to instances with fewer than 1,000 nodes (customers), often solving instances with 50 to a few hundred nodes to proven optimality.

  * Resource-Constrained Shortest Path Problem (RCESPP): The Subproblem (the pricing problem) in CG is the RCESPP, which is itself NP-hard. The time required to solve the RCESPP increases rapidly as constraints (like time windows, makespan, and capacity) are added. Attempting to solve 10,000 unique customer RCESPPs repeatedly across hundreds of iterations for an exact solution is computationally prohibitive.

*the global and local optimalities can be reached in case of moderate set of orders (limited < 1000), and it depends on its scalability over a set of nodes.

#### CG Scalability: The Decomposition Principle

The scalability of CG comes from breaking a massive, "monolithic" problem into two interacting parts:

  * The Restricted Master Problem (RMP): A version of the original problem containing only a small subset of possible variables (columns).

  * The Pricing Subproblem: A separate, smaller optimization problem that "searches" for the next best variable to add to the Master Problem.

Key Scalability Drivers:

  - Memory Efficiency: we don't need to load all variables into memory. For example, in a Cutting Stock Problem, there are billions of ways to cut a roll of paper. CG only generates the dozen or so "patterns" that actually improve the current solution.

  - Parallelization: The Pricing Subproblem can often be decomposed further. If we have 100 different vehicles to schedule, we can solve 100 independent pricing problems in parallel on different CPU cores.

  - Tight Lower Bounds: When used within a Branch-and-Price framework, the Dantzig-Wolfe decomposition typically provides much "tighter" (more accurate) LP relaxations than the original compact formulation. This prunes the search tree significantly, preventing the "exponential explosion" of nodes.

#### Then why not use Column Generation instead of trained GNN?

The trained GNN could be recommended for these reasons:

1. Inference Speed (The Real World Constraint):

    * Column Generation: It is an iterative algorithm. For a new day of orders, running a full Branch-and-Price (CG + Branching) solver can take minutes to hours.

   *  GNN: Once trained, it produces a solution in ***seconds*** (since it's size invariant, treating only with reusing the trained weights). In a live warehouse environment, speed is often more critical than proving mathematical optimality.

2. Complexity of Makespan (Z):

    * Column Generation is fantastic for minimizing the Sum of costs (e.g., Total Distance).

    * Minimizing the Makespan ($Z=max(C_{i​})$) is notoriously difficult in Column Generation because the cost of one column depends on the duration of other columns (to determine the max). It requires complex techniques like "Binary Search on Makespan," which makes it even slower.