<a href="https://colab.research.google.com/github/PATELOM925/Data-Mining/blob/main/Copy_of_Idea_%5BMichael%5D_GNN_for_Structured_Problem_Representation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup
Split into two cells, as Colab needs to restart after installing requirements from Starjob repo.
- Downloads the Starjob codebase from GitHub: https://github.com/starjob42/Starjob.
- Installs pip requirements
    - First, requirements from the Starjob repo.
    - Then, extra requirements so that the workflow works on Google Colab.
- Downloads the Starjob dataset from HuggingFace, move to the expected folder for the finetuning script (`train_llama_3.py`).

In [1]:
!git clone https://github.com/starjob42/Starjob.git

# seems to conflict with requirements of unsloth on Colab
#!pip install -r Starjob/requirements.txt

Cloning into 'Starjob'...
remote: Enumerating objects: 40, done.[K
remote: Counting objects: 100% (40/40), done.[K
remote: Compressing objects: 100% (35/35), done.[K
remote: Total 40 (delta 14), reused 8 (delta 1), pack-reused 0 (from 0)[K
Receiving objects: 100% (40/40), 15.08 KiB | 3.02 MiB/s, done.
Resolving deltas: 100% (14/14), done.
Downloading starjob130k.json (1.9 GB)
Error downloading object: starjob130k.json (c9ec14e): Smudge error: Error downloading starjob130k.json (c9ec14ef703a5aa4cc360fe82d8f806fbe03cbee5846a8ff34c503baf6290dd1): batch response: This repository exceeded its LFS budget. The account responsible for the budget should increase it to restore access.

Errors logged to /content/Starjob/.git/lfs/logs/20251218T010640.991243838.log
Use `git lfs logs last` to view the log.
error: external filter 'git-lfs filter-process' failed
fatal: starjob130k.json: smudge filter lfs failed
You can inspect what was checked out with 'git status'
and retry with 'git restore --so

In [2]:
%%capture
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes

In [3]:
# Manually including packages from `requirements.txt`, since the versions conflict with unsloth
!pip install wandb datasets torch_geometric

!mkdir data/
!hf download mideavalwisard/Starjob --repo-type dataset --local-dir data/

Collecting torch_geometric
  Downloading torch_geometric-2.7.0-py3-none-any.whl.metadata (63 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m63.7/63.7 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
Downloading torch_geometric-2.7.0-py3-none-any.whl (1.3 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m1.3/1.3 MB[0m [31m70.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: torch_geometric
Successfully installed torch_geometric-2.7.0
Fetching 3 files:   0% 0/3 [00:00<?, ?it/s]Downloading 'starjob_130k.json' to 'data/.cache/huggingface/download/vLHkfhQ2OA1O6yx8HJW6OpwpvF0=.c9ec14ef703a5aa4cc360fe82d8f806fbe03cbee5846a8ff34c503baf6290dd1.incomplete'
Downloading 'README.md' to 'data/.cache/huggingface/download/Xn7B-BWUGOee2Y6hCZtEhtFu4BE=.7b2062d30f6825b8a7b92c5861bd0306ace41478

# Fine-tuning
Manually pasting the content of `train_llama_3.py` below, splitting the cells at key points. Any other changes are highlighted.

## Initial Configuration

- Modifying to use static variables instead of `argparse` (which doesn't work with Colab).
- Also changed `dtype` to `float16`, since the Tesla T4 does not support `bfloat16`.

In [4]:
# =========================
# Imports
# =========================

# When GPU available (i.e., full workflow)
import argparse
import torch
import wandb
from unsloth import FastLanguageModel, is_bfloat16_supported
from datasets import load_dataset

import re
import numpy as np
from typing import Tuple, List, Dict

import networkx as nx
from collections import defaultdict

from torch_geometric.data import Data
from torch_geometric.utils import from_networkx
from tqdm import tqdm

from trl import SFTTrainer
from transformers import TrainingArguments

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!


In [5]:
# When GPU unavailable (i.e., dataset work only)
# import torch
# from datasets import load_dataset

# import re
# import numpy as np
# from typing import Tuple, List, Dict

# import networkx as nx
# from collections import defaultdict

# from torch_geometric.data import Data
# from torch_geometric.utils import from_networkx
# from tqdm import tqdm

In [6]:
# =========================
# Arguments
# =========================
# NOTE: Original script uses argparse, which doesn't seem to work with Colab
# Converting to plain variables instead

# Model and data parameters
max_seq_length = 50000
dtype = 'float16'           # NOTE: Changed from "bfloat16", which is not supported by T4 GPU
load_in_4bit = True

# LoRA hyperparameters
lora_r = 64
lora_alpha = 64
lora_dropout = 0.0
bias = 'none'

# Additional configurations
use_gradient_checkpointing = 'unsloth'
random_state = 42
use_rslora = False
loftq_config = None

# Training hyperparameters
per_device_train_batch_size = 4
gradient_accumulation_steps = 4
warmup_steps = 5
num_train_epochs = 2
learning_rate = 2e-4
logging_steps = 1
optim = 'adamw_8bit'
weight_decay = 0.01
lr_scheduler_type = 'linear'
seed = 42
save_total_limit = 50
save_step = 200
per_device_eval_batch_size = 2
train_lm_head = False
train_embed_tokens = False

# Output directory
output_dir = None

In [7]:
# =========================
# Generate Output Directory Name
# =========================

# Create an output directory name based on hyperparameters
if output_dir is None:
    dir_out = f"output_alpha{lora_alpha}_r{lora_r}_train_lm_head{train_lm_head}_train_embed_tok_{train_embed_tokens}_seq{max_seq_length}_b{per_device_train_batch_size}_ep{num_train_epochs}"
else:
    dir_out = output_dir

# =========================
# Initialize WandB
# =========================

# Initialize Weights & Biases for experiment tracking
wandb.init(
    project="llama3-jssp-clean",  # Change the project name if needed
    name=dir_out,
)

  | |_| | '_ \/ _` / _` |  _/ -_)
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice:

 3


wandb: You chose "Don't visualize my results"


wandb: Detected [huggingface_hub.inference, openai] in use.
wandb: Use W&B Weave for improved LLM call tracing. Install Weave with `pip install weave` then add `import weave` to the top of your script.
wandb: For more information, check out the docs at: https://weave-docs.wandb.ai/


## Initialising model

In [8]:
# =========================
# Load Model and Tokenizer
# =========================

# Set dtype
dtype = torch.bfloat16 if dtype == 'bfloat16' else torch.float16

# Load the pre-trained model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

target_modules =[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ]
if train_lm_head:
    target_modules.append('lm_head')
if train_embed_tokens:
    target_modules.append('embed_tokens')

# Configure the model with PEFT (Parameter-Efficient Fine-Tuning)
model = FastLanguageModel.get_peft_model(
    model,
    r=lora_r,
    target_modules=target_modules,
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    bias=bias,
    use_gradient_checkpointing=use_gradient_checkpointing,
    random_state=random_state,
    use_rslora=use_rslora,
    loftq_config=loftq_config,
)

==((====))==  Unsloth 2025.12.6: Fast Llama patching. Transformers: 4.57.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

Unsloth 2025.12.6 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


## Preparing dataset
- Moved the `alpaca_prompt` template string and `formatting_prompts_func` function into this area, for cleanliness.
- Added extra comments for clarity.
- Incorporating the idea from Om's notebook, i.e., removing old-style Starjob instances: https://colab.research.google.com/drive/1bZjfnDFPrsp3CBXNa2nXr0AJtKXLDlgw?usp=sharing

In [9]:
# =========================
# Load and Prepare Dataset
# =========================

# load the data from the ./data folder
dataset = load_dataset('./data/', split="train")

# Remove old-style instances
dataset_new_style = dataset.filter(lambda x: x['matrix'] is not None)
print(f"lenght of og dataset: {len(dataset)} \n lenght of new-style dataset: {len(dataset_new_style)}")

# Obtain train/test splits
# split_dataset = dataset.train_test_split(test_size=0.02, seed=seed)
# split_dataset = dataset_new_style.train_test_split(test_size=0.02, seed=seed)

# #Colab friendly Processing: Choose only 30% of the new dataset
# dataset_new_style_colab = dataset_new_style.shuffle(seed=seed).select(range(int(len(dataset_new_style)*0.3)))
# print(f"lenght of colab friendly dataset: {len(dataset_new_style_colab)}")
# split_dataset = dataset_new_style_colab.train_test_split(test_size=0.02, seed=seed)

#More Colab friendly Processing: Choose only 20% of the new dataset
dataset_new_style_colab = dataset_new_style.shuffle(seed=seed).select(range(int(len(dataset_new_style)*0.2)))
print(f"lenght of colab friendly dataset: {len(dataset_new_style_colab)}")
split_dataset = dataset_new_style_colab.train_test_split(test_size=0.06, seed=seed)

# Define the Alpaca-style prompt template
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""
EOS_TOKEN = tokenizer.eos_token
# EOS_TOKEN = "END"       # Only while GPU unavailable

# Define function to create prompts with this template, using columns from the dataset
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs = examples["input"]
    outputs = examples["output"]
    texts = []
    for instruction, input_text, output in zip(instructions, inputs, outputs):
        text = alpaca_prompt.format(instruction, input_text, output) + EOS_TOKEN
        texts.append(text)
    return {"text": texts}


# Map the train/test splits through the above function, to create prompts for the LLM
train_dataset = split_dataset['train'].map(formatting_prompts_func, batched=True)
eval_dataset = split_dataset['test'].map(formatting_prompts_func, batched=True)

print(f"Number of train examples: {len(train_dataset)} \n\n Number of test examples: {len(eval_dataset)}")

Generating train split: 0 examples [00:00, ? examples/s]

Filter:   0%|          | 0/129720 [00:00<?, ? examples/s]

lenght of og dataset: 129720 
 lenght of new-style dataset: 9720
lenght of colab friendly dataset: 1944


Map:   0%|          | 0/1827 [00:00<?, ? examples/s]

Map:   0%|          | 0/117 [00:00<?, ? examples/s]

Number of train examples: 1827 

 Number of test examples: 117


### Issue with dataset
NOTE: Seems to be a significant issue with this script. We can see below that the "text" column of the old-style instances (i.e., from the original "LLMs can Schedule" paper) has _no_ instruction or input in its "text" column, only a response. This means that the LLM is being trained on useless prompts.


#### Screenshot


![link text](https://drive.google.com/uc?id=1EteiynGyp7X5KDPh0kqnmDJWbBg4aYaF)

#### Code

In [10]:
# Print out one of each for comparison
# New-style: Has a "matrix" column
new_sample = None
i = 0
while True:
    i += 1
    if train_dataset[i]['matrix'] is not None:
        new_sample = train_dataset[i]
        # print(new_sample)         # All columns
        print(new_sample["text"]) # Only the "text" column (i.e., the one used by the LLM)
        break

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Optimize schedule for 30 Jobs (denoted as J) across 15 Machines (denoted as M) to minimize makespan. The makespan is the completion time of the last operation in the schedule. Each M can process only one J at a time, and once started, J cannot be interrupted.



### Input:
J0:
M13:91 M1:68 M11:106 M0:67 M14:119 M10:50 M3:44 M5:28 M6:122 M9:122 M7:12 M4:129 M2:162 M8:147 M12:105 
J1:
M8:96 M2:50 M3:167 M0:161 M10:75 M6:126 M4:42 M13:86 M9:135 M1:9 M11:96 M14:106 M12:111 M7:130 M5:50 
J2:
M14:166 M12:70 M13:67 M10:33 M8:8 M2:153 M0:86 M7:77 M1:71 M5:80 M11:34 M6:169 M3:66 M9:52 M4:30 
J3:
M6:115 M4:13 M9:49 M8:155 M11:43 M10:101 M2:23 M14:23 M12:15 M3:131 M0:160 M1:150 M7:9 M13:49 M5:109 
J4:
M7:6 M6:16 M13:153 M11:103 M14:59 M8:137 M0:79 M2:169 M12:75 M10:66 M9:11 M4:22 M1:162 M5:129 M3:160 
J5:
M11:118 M4:75 M9:1

In [None]:
# # Old-style: No "matrix" column
# old_sample = None
# i = 0
# while True:
#     i += 1
#     if train_dataset[i]['matrix'] is None:
#         old_sample = train_dataset[i]
#         # print(old_sample)         # All columns
#         print(old_sample["text"]) # Only the "text" column (i.e., the one used by the LLM)
#         break

## GNN Encoding of Instances

In [11]:
# =========================
# Convert Dataset to GNN Encoding
# =========================
# NOTE: Assuming that we only want to work with "new"-style instances (pending decision)

# Function to parse JSSP instances (from Starjob paper) into structured format.
def parse_jssp_instance(instance: Dict) -> Dict:
    """Returns:
    {
        num_jobs: number of jobs in problem
        num_machines: number of machines in problem
        num_operations: number of operations in problem
        job_operations: dictionary of job -> operations, where operations is a list of machine-duration pairs
    }
    """
    # Number of jobs and machines are already stored in the instance, just need to parse operations
    job_operations = {}
    lines = instance['input'].strip().split('\n')
    for i in range(len(lines)):
        # Assumes 'jobs' specified every second line, with 'machine-operations' every other line
        if i % 2 == 0:
            job_id = int(re.search(r'J(\d+):', lines[i]).group(1))
            operations = re.findall(r'M(\d+):(\d+)', lines[i+1])
            job_operations[job_id] = [(int(m), int(d)) for m, d in operations]
        # Skip every other line, as we already handle them above
        else:
            continue

    # Also extract number of operations
    # NOTE: We assume that each job has the same number of operations, so we just multiply
    # the number of operations in the first job by the number of jobs
    num_operations = instance['num_jobs'] * len(next(iter(job_operations.values())))

    # Return everything as structured dictionary
    return {
        'num_jobs': instance['num_jobs'],
        'num_machines': instance['num_machines'],
        'num_operations': num_operations,
        'job_operations': job_operations,
    }

# If we need to use old-style, we can use a different parsing function
def parse_jssp_instance_with_old(nl_description: str) -> Dict:
    """
    Parse natural language JSSP description (from Starjob paper) into dictionary format.

    Returns:
    {
        num_jobs: number of jobs in problem
        num_machines: number of machines in problem
        num_operations: number of operations in problem
        job_operations: dictionary of job -> operations, where operations is a list of machine-duration pairs
    }
    """
    lines = nl_description.strip().split('\n')

    # Extract number of jobs and machines
    problem_line = lines[0]
    num_jobs = int(re.search(r'(\d+)\s+Jobs', problem_line).group(1))
    num_machines = int(re.search(r'(\d+)\s+Machines', problem_line).group(1))

    # Parse operations
    job_operations = {}
    for line in lines:
        if line.startswith('J'):
            job_id = int(re.search(r'J(\d+):', line).group(1))
            operations = re.findall(r'M(\d+):(\d+)', line)
            job_operations[job_id] = [(int(m), int(d)) for m, d in operations]

    # Extract number of operations
    # NOTE: We assume that each job has the same number of operations, so we just multiply
    # the number of operations in the first job by the number of jobs
    num_operations = num_jobs * len(next(iter(job_operations.values())))

    return {
        'num_jobs': num_jobs,
        'num_machines': num_machines,
        'num_operations': num_operations,
        'job_operations': job_operations,
    }

In [12]:
# Function to build disjunctive graph representation of the parsed (structured) JSSP dictionary
# Uses networkx.DiGraph to store the directed graph, for better interpretability
def build_disjunctive_graph(parsed_problem: Dict) -> Tuple[nx.DiGraph, Dict]:
    """
    Return graph with structure:
    - Nodes: (job_id, operation_id) tuples + source and sink nodes (i.e., len = num_operations + 2)
    - Directed edges: Job precedence constraints
    - Undirected edges: Machine conflict constraints
    - Node attributes: machine_id, processing_time
    """

    # Initialise the graph
    G = nx.DiGraph()

    # Extract the operations for this JSSP instance
    job_ops = parsed_problem['job_operations']

    # Add source and sink nodes
    source = ('source', 0)
    sink = ('sink', 0)
    G.add_node(source, node_type='source', processing_time=0)
    G.add_node(sink, node_type='sink', processing_time=0)

    # Initialise reference dictionariesMap operation nodes and create node info dictionary
    node_to_operation = {}  # node -> (job_id, op_id, machine_id, proc_time) [NOT NEEDED?]
    # operation_to_node = {}  # (job_id, op_id) -> node [NOT NEEDED?]
    operations_by_machine = defaultdict(list)   # machine_id -> [nodes]

    # Iterate over each job to create nodes of each operation
    for job_id, operations in job_ops.items():
        # Then iterate over each operation of that job
        for op_id, (machine_id, proc_time) in enumerate(operations):
            # Each node stores the current job and operation
            node = (job_id, op_id)

            # Add the node, with relevant info for this operation
            G.add_node(node,
                      node_type='operation',
                      job_id=job_id,
                      operation_id=op_id,
                      machine_id=machine_id,
                      processing_time=proc_time)

            # Add same info to our reference dictionaries
            node_to_operation[node] = (job_id, op_id, machine_id, proc_time)
            # operation_to_node[(job_id, op_id)] = node
            operations_by_machine[machine_id].append(node)

    # Iterate over each job to add conjunctive edges (precedence constraints within jobs)
    for job_id, operations in job_ops.items():
        # First edge: Source ‚Üí first operation of this job
        # first_op = operation_to_node[(job_id, 0)]
        first_op = (job_id, 0)
        G.add_edge(source, first_op, edge_type='conjunctive', weight=0)

        # Main edges: Operations in sequence within this job
        for op_id in range(len(operations) - 1):
            # from_node = operation_to_node[(job_id, op_id)]
            # to_node = operation_to_node[(job_id, op_id + 1)]
            from_node = (job_id, op_id)
            to_node = (job_id, op_id + 1)
            proc_time = operations[op_id][1]
            G.add_edge(from_node, to_node, edge_type='conjunctive', weight=proc_time)

        # Last edge: Last operation of this job -> sink
        # last_op = operation_to_node[(job_id, len(operations) - 1)]
        last_op = (job_id, len(operations) - 1)
        last_proc_time = operations[-1][1]
        G.add_edge(last_op, sink, edge_type='conjunctive', weight=last_proc_time)

    # Iterate over each machine to add disjunctive edges (machine constraints)
    # These will be undirected initially; orientation and weighting happens during scheduling
    for machine_id, ops_on_machine in operations_by_machine.items():
        # Initial graph is complete bipartite graph (all pairs can conflict)
        for i in range(len(ops_on_machine)):
            for j in range(i + 1, len(ops_on_machine)):
                op1, op2 = ops_on_machine[i], ops_on_machine[j]
                # Add as undirected edge (will be oriented during solution)
                G.add_edge(op1, op2, edge_type='disjunctive_undirected', machine_id=machine_id)
                G.add_edge(op2, op1, edge_type='disjunctive_undirected', machine_id=machine_id)

    # Return initialised graph, with mappings of each node to its associated operation info
    return G, node_to_operation

In [13]:
# Function to convert networkx.DiGraph to a PyTorch Geometric Data object
def graph_to_pygdata(G: nx.DiGraph,
                     node_to_operation: Dict) -> Data:
    """
    Node features:
    - machine_id (one-hot encoded)
    - processing_time (normalized)
    - job_id (one-hot encoded)
    - operation_index (normalized)
    - node_type (source/sink/operation)

    Edge features:
    - edge_type (conjunctive / disjunctive)
    - weight (processing time for conjunctive edges)
    """

    # NOTE: Not currently using node_to_operation, as all the op info is stored in the node

    # Extract all operation nodes (exclude source/sink initially)
    operation_nodes = [n for n in G.nodes() if G.nodes[n]['node_type'] == 'operation']
    num_jobs = max([G.nodes[n]['job_id'] for n in operation_nodes]) + 1
    num_machines = max([G.nodes[n]['machine_id'] for n in operation_nodes]) + 1
    max_ops_per_job = max([G.nodes[n]['operation_id'] for n in operation_nodes]) + 1

    # print(operation_nodes)
    # print(num_jobs)
    # print(num_machines)
    # print(max_ops_per_job)

    # Create node index mapping (including source/sink nodes)
    # NOTE: Reordering the list to put source at start and sink at end, but this may not be needed
    # If so, can just use `node_idx = {node: i for i, node in enumerate(list(G))}`
    node_list = [list(G)[0]] + operation_nodes + [list(G)[1]]
    node_idx = {node: i for i, node in enumerate(node_list)}

    # print(node_list)
    # print(node_idx)

    # Iterate over nodes to create node feature matrix
    num_nodes = len(node_list)
    node_features = []
    for node in node_list:
        if G.nodes[node]['node_type'] == 'source':
            # Source node: all zeros
            features = [0] * (num_machines + num_jobs + 2)
            node_features.append(features)
        elif G.nodes[node]['node_type'] == 'sink':
            # Sink node: all zeros
            features = [0] * (num_machines + num_jobs + 2)
            node_features.append(features)
        else:
            # Operation node
            machine_id = G.nodes[node]['machine_id']
            job_id = G.nodes[node]['job_id']
            op_idx = G.nodes[node]['operation_id']
            proc_time = G.nodes[node]['processing_time']

            # TODO: Study/comment everything from this point
            # One-hot machine encoding
            machine_one_hot = [1 if i == machine_id else 0 for i in range(num_machines)]

            # One-hot job encoding
            job_one_hot = [1 if i == job_id else 0 for i in range(num_jobs)]

            # Normalized operation index and processing time
            op_idx_norm = op_idx / max(1, max_ops_per_job - 1)
            proc_time_norm = proc_time / 500.0  # Normalize by max typical processing time

            features = machine_one_hot + job_one_hot + [op_idx_norm, proc_time_norm]
            node_features.append(features)

    x = torch.tensor(node_features, dtype=torch.float32)

    # Create edge list
    edge_index = [[], []]
    edge_attr = []

    for u, v, data in G.edges(data=True):
        u_idx = node_idx[u]
        v_idx = node_idx[v]
        edge_index[0].append(u_idx)
        edge_index[1].append(v_idx)

        # Edge type: conjunctive=1, disjunctive=0
        edge_type = 1 if data['edge_type'] == 'conjunctive' else 0
        weight = data.get('weight', 0) / 500.0  # Normalize weight

        edge_attr.append([edge_type, weight])

    edge_index = torch.tensor(edge_index, dtype=torch.long)
    edge_attr = torch.tensor(edge_attr, dtype=torch.float32) if edge_attr else torch.zeros((0, 2))

    # Create PyTorch Geometric Data object
    pyg_data = Data(
        x=x,
        edge_index=edge_index,
        edge_attr=edge_attr,
        num_nodes=num_nodes,
        num_jobs=num_jobs,
        num_machines=num_machines,
        node_mapping=node_idx
    )

    return pyg_data


In [14]:
# Use above utility functions to convert natural language grompt into PyG Data
from pprint import pp

# def convert_starjob_prompt_to_gnn(nl_prompt: str) -> Data:
def convert_starjob_instance_to_gnn(instance: Dict) -> Data:
    """
    Full pipeline: Instance ‚Üí Disjunctive Graph ‚Üí PyG Data
    """
    # Step 1: Parse
    parsed = parse_jssp_instance(instance)
    # pp(parsed, compact=True, width=100)
    # return parsed

    # Step 2: Build graph
    G, node_to_op = build_disjunctive_graph(parsed)
    # print(G)
    # pp(node_to_op)
    # return G, node_to_op

    # Step 3: Convert to PyG
    pyg_data = graph_to_pygdata(G, node_to_op)

    return pyg_data

# print(new_sample['input'])
# print(type(new_sample))
pyg_data = convert_starjob_instance_to_gnn(new_sample)

In [15]:
print(f"Nodes: {pyg_data.num_nodes}")
print(f"Edges: {pyg_data.edge_index.shape[1]}")
print(f"Node features shape: {pyg_data.x.shape}")
print(f"Edge features shape: {pyg_data.edge_attr.shape}")

Nodes: 452
Edges: 13530
Node features shape: torch.Size([452, 47])
Edge features shape: torch.Size([13530, 2])


In [16]:
# Convert all prompts to GNN representations
train_dataset_gnn = []
for instance in tqdm(train_dataset):
    try:
        pyg_data = convert_starjob_instance_to_gnn(instance)
        pyg_data.target_solution = instance['output']  # Attach solution
        train_dataset_gnn.append(pyg_data)
    except Exception as e:
        # print(f"Failed to parse: {e}")
        continue
print(f"Created {len(train_dataset_gnn)} GNN instances")

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1827/1827 [15:23<00:00,  1.98it/s]

Created 1827 GNN instances





In [17]:
torch.save(train_dataset_gnn, "train_dataset_gnn.pt")

In [18]:
print(type(train_dataset_gnn))
print(len(train_dataset_gnn))
print(train_dataset_gnn[0])

<class 'list'>
1827
Data(
  x=[602, 57],
  edge_index=[2, 24040],
  edge_attr=[24040, 2],
  num_nodes=602,
  num_jobs=40,
  num_machines=15,
  node_mapping={
    (source, 0)=0,
    (0, 0)=1,
    (0, 1)=2,
    (0, 2)=3,
    (0, 3)=4,
    (0, 4)=5,
    (0, 5)=6,
    (0, 6)=7,
    (0, 7)=8,
    (0, 8)=9,
    (0, 9)=10,
    (0, 10)=11,
    (0, 11)=12,
    (0, 12)=13,
    (0, 13)=14,
    (0, 14)=15,
    (1, 0)=16,
    (1, 1)=17,
    (1, 2)=18,
    (1, 3)=19,
    (1, 4)=20,
    (1, 5)=21,
    (1, 6)=22,
    (1, 7)=23,
    (1, 8)=24,
    (1, 9)=25,
    (1, 10)=26,
    (1, 11)=27,
    (1, 12)=28,
    (1, 13)=29,
    (1, 14)=30,
    (2, 0)=31,
    (2, 1)=32,
    (2, 2)=33,
    (2, 3)=34,
    (2, 4)=35,
    (2, 5)=36,
    (2, 6)=37,
    (2, 7)=38,
    (2, 8)=39,
    (2, 9)=40,
    (2, 10)=41,
    (2, 11)=42,
    (2, 12)=43,
    (2, 13)=44,
    (2, 14)=45,
    (3, 0)=46,
    (3, 1)=47,
    (3, 2)=48,
    (3, 3)=49,
    (3, 4)=50,
    (3, 5)=51,
    (3, 6)=52,
    (3, 7)=53,
    (3, 8)=54,
    (3,

In [19]:
# Same on eval_dataset
eval_dataset_gnn = []
for instance in tqdm(eval_dataset):
    try:
        pyg_data = convert_starjob_instance_to_gnn(instance)
        pyg_data.target_solution = instance['output']
        eval_dataset_gnn.append(pyg_data)
    except Exception as e:
        continue

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 117/117 [00:54<00:00,  2.17it/s]


In [20]:
torch.save(eval_dataset_gnn, "eval_dataset_gnn.pt")

In [21]:
print(type(eval_dataset_gnn))
print(len(eval_dataset_gnn))
print(eval_dataset_gnn[0])

<class 'list'>
117
Data(
  x=[802, 62],
  edge_index=[2, 32040],
  edge_attr=[32040, 2],
  num_nodes=802,
  num_jobs=40,
  num_machines=20,
  node_mapping={
    (source, 0)=0,
    (0, 0)=1,
    (0, 1)=2,
    (0, 2)=3,
    (0, 3)=4,
    (0, 4)=5,
    (0, 5)=6,
    (0, 6)=7,
    (0, 7)=8,
    (0, 8)=9,
    (0, 9)=10,
    (0, 10)=11,
    (0, 11)=12,
    (0, 12)=13,
    (0, 13)=14,
    (0, 14)=15,
    (0, 15)=16,
    (0, 16)=17,
    (0, 17)=18,
    (0, 18)=19,
    (0, 19)=20,
    (1, 0)=21,
    (1, 1)=22,
    (1, 2)=23,
    (1, 3)=24,
    (1, 4)=25,
    (1, 5)=26,
    (1, 6)=27,
    (1, 7)=28,
    (1, 8)=29,
    (1, 9)=30,
    (1, 10)=31,
    (1, 11)=32,
    (1, 12)=33,
    (1, 13)=34,
    (1, 14)=35,
    (1, 15)=36,
    (1, 16)=37,
    (1, 17)=38,
    (1, 18)=39,
    (1, 19)=40,
    (2, 0)=41,
    (2, 1)=42,
    (2, 2)=43,
    (2, 3)=44,
    (2, 4)=45,
    (2, 5)=46,
    (2, 6)=47,
    (2, 7)=48,
    (2, 8)=49,
    (2, 9)=50,
    (2, 10)=51,
    (2, 11)=52,
    (2, 12)=53,
    (2, 13)=54,

In [22]:
#Save it on Drive
from google.colab import drive
drive.mount('/content/drive')   # or '/content/gdrive' in some guides


Mounted at /content/drive


In [28]:
!ls '/content/drive/MyDrive/EECS 6412 - Project/3. Solution & Final Report/GNN_DATASET'

In [31]:
torch.save(train_dataset_gnn, "/content/drive/MyDrive/EECS 6412 - Project/3. Solution & Final Report/GNN_DATASET/train_dataset_gnn.pt")
print("Task Done")

Task Done


In [30]:
torch.save(eval_dataset_gnn, "/content/drive/MyDrive/EECS 6412 - Project/3. Solution & Final Report/GNN_DATASET/eval_dataset_gnn.pt")

### TODO: Study below
PPLX generated below steps to prep `*_dataset_gnn`, before passing to the LLM for fine-tuning. Need to study in more detail for presentation and paper.

In [32]:
def pyg_to_augmented_nl(pyg_data: Data, original_nl: str = None) -> str:
    """
    Convert PyG graph back to natural language + explicit constraint annotations.
    This adds semantic structure without losing interpretability.
    """

    num_jobs = pyg_data.num_jobs
    num_machines = pyg_data.num_machines

    # If we have original NL, use it as foundation
    if original_nl:
        augmented = original_nl + "\n\n[GRAPH_CONSTRAINTS]\n"
    else:
        augmented = f"Optimize schedule for {num_jobs} Jobs across {num_machines} Machines.\n"

    # Extract edges and annotate constraint types
    precedence_edges = []
    machine_edges = []

    for edge_idx in range(pyg_data.edge_index.shape[1]):
        u = pyg_data.edge_index[0, edge_idx].item()
        v = pyg_data.edge_index[1, edge_idx].item()
        edge_type = pyg_data.edge_attr[edge_idx, 0].item()

        if edge_type == 1:  # Conjunctive
            precedence_edges.append((u, v))
        else:  # Disjunctive
            machine_edges.append((u, v))

    if precedence_edges:
        augmented += "Precedence constraints: "
        augmented += ", ".join([f"Op{u}‚ÜíOp{v}" for u, v in precedence_edges[:5]])  # Show first 5
        if len(precedence_edges) > 5:
            augmented += f" ... (+{len(precedence_edges) - 5} more)"
        augmented += "\n"

    if machine_edges:
        augmented += f"Machine conflict constraints: {len(machine_edges)} conflicts\n"

    augmented += "[/GRAPH_CONSTRAINTS]\n"

    return augmented

def convert_gnn_dataset_to_text_dataset(gnn_dataset, train_dataset_orig,
                                        tokenizer, alpaca_prompt):
    """
    Convert gnn_dataset (PyG objects) to HuggingFace dataset with text field.

    Args:
        gnn_dataset: List of dicts with 'pyg_data', 'target_solution'
        train_dataset_orig: Original Hugging Face dataset (for reference)
        tokenizer: HF tokenizer
        alpaca_prompt: Alpaca template

    Returns:
        HuggingFace Dataset with 'text' field
    """
    from datasets import Dataset

    texts = []

    for idx, gnn_item in enumerate(gnn_dataset):
        pyg_data = gnn_item['pyg_data']
        solution = gnn_item['target_solution']

        # Get original NL description if available
        orig_nl = train_dataset_orig[idx]['input'] if idx < len(train_dataset_orig) else None

        # Augmented NL
        input_text = pyg_to_augmented_nl(pyg_data, orig_nl)

        instruction = "Solve the job shop scheduling problem below:"
        formatted = alpaca_prompt.format(instruction, input_text, solution) + tokenizer.eos_token
        texts.append(formatted)

    # Convert to HF Dataset
    text_dataset = Dataset.from_dict({
        'text': texts,
        'idx': list(range(len(texts)))
    })

    return text_dataset

In [35]:
# =========================
# Convert GNN Dataset to Text Format for Training
# =========================

#Got a key error as it expects dict type object but our datasets (train/eval)
# Are of list type as we see above.

def build_gnn_wrapped_dataset(gnn_list, orig_dataset):
    wrapped = []
    for i, pyg_data in enumerate(gnn_list):
        wrapped.append({
            "pyg_data": pyg_data,
            "target_solution": orig_dataset[i]["output"],  # or whatever field holds the solution
        })
    return wrapped
print("Converting datasets to dict type")

train_dataset_gnn_wrapped = build_gnn_wrapped_dataset(train_dataset_gnn, train_dataset)
eval_dataset_gnn_wrapped  = build_gnn_wrapped_dataset(eval_dataset_gnn,  eval_dataset)

print("Converting GNN datasets to text format for LLM training...")

# Create augmented text datasets
train_dataset_gnn_text = convert_gnn_dataset_to_text_dataset(
    gnn_dataset=train_dataset_gnn_wrapped,
    train_dataset_orig=train_dataset,
    tokenizer=tokenizer,
    alpaca_prompt=alpaca_prompt,
)

eval_dataset_gnn_text = convert_gnn_dataset_to_text_dataset(
    gnn_dataset=eval_dataset_gnn_wrapped,
    train_dataset_orig=eval_dataset,
    tokenizer=tokenizer,
    alpaca_prompt=alpaca_prompt,
)

print(f"Training dataset: {len(train_dataset_gnn_text)} instances")
print(f"Eval dataset: {len(eval_dataset_gnn_text)} instances")

# Verify format
print("\nSample training instance:")
print(train_dataset_gnn_text[0]['text'][:500])  # Print first 500 chars

Converting datasets to dict type
Converting GNN datasets to text format for LLM training...
Training dataset: 1827 instances
Eval dataset: 117 instances

Sample training instance:
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Solve the job shop scheduling problem below:

### Input:
J0:
M2:124 M14:89 M1:72 M6:79 M8:116 M0:147 M12:22 M3:30 M10:128 M11:154 M9:14 M4:50 M5:155 M7:84 M13:18 
J1:
M4:68 M1:35 M2:134 M10:178 M8:31 M7:120 M11:11 M9:57 M6:47 M5:172 M3:119 M14:174 M0:34 M12:123 M13:49 
J2:
M9:175 M8:33 M14:159 M5:135 M6:5 M7:148 M4:168 M1:


In [38]:
#Saving the converted datasets
import json

train_dataset_gnn_text.to_json(
    "/content/drive/MyDrive/EECS 6412 - Project/3. Solution & Final Report/GNN_DATASET/train_dataset_gnn_text.json"
)
eval_dataset_gnn_text.to_json(
    "/content/drive/MyDrive/EECS 6412 - Project/3. Solution & Final Report/GNN_DATASET/eval_dataset_gnn_text.json"
)


Creating json from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

4706112

## Training (with GNN)
### TODO: Save model!!
This notebook currently doesn't save the model after fine-tuning. Be sure to save it before actually running the fine-tuning, so that we don't lose time!

In [None]:
# =========================
# Initialize the Trainer with GNN-augmented dataset
# =========================

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset_gnn_text,  # Using GNN-derived datasets
    eval_dataset=eval_dataset_gnn_text,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=20,
    packing=False,
    args=TrainingArguments(
        per_device_train_batch_size=per_device_train_batch_size,
        gradient_accumulation_steps=gradient_accumulation_steps,
        warmup_steps=warmup_steps,
        num_train_epochs=num_train_epochs,
        learning_rate=learning_rate,
        # fp16=True,
        bf16=is_bfloat16_supported(),
        logging_steps=logging_steps,
        optim=optim,
        weight_decay=weight_decay,
        lr_scheduler_type=lr_scheduler_type,
        seed=seed,
        output_dir=dir_out,
        report_to="wandb",
        load_best_model_at_end=True,
        metric_for_best_model="eval_loss",
        greater_is_better=False,
        save_total_limit=save_total_limit,
        save_steps=save_step,
        eval_strategy="steps",
        eval_steps=save_step,
        per_device_eval_batch_size=per_device_eval_batch_size,
    ),
)


In [None]:
# =========================
# Monitor GPU Memory Usage
# =========================

gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

# =========================
# Start Training
# =========================

trainer_stats = trainer.train()

## Training (Original)

In [None]:
# =========================
# Initialize the Trainer
# =========================

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=20,
    packing=False,
    args=TrainingArguments(
        per_device_train_batch_size=per_device_train_batch_size,
        gradient_accumulation_steps=gradient_accumulation_steps,
        warmup_steps=warmup_steps,
        num_train_epochs=num_train_epochs,
        learning_rate=learning_rate,
        # fp16=True,
        bf16=is_bfloat16_supported(),
        logging_steps=logging_steps,
        optim=optim,
        weight_decay=weight_decay,
        lr_scheduler_type=lr_scheduler_type,
        seed=seed,
        output_dir=dir_out,
        report_to="wandb",
        load_best_model_at_end=True,
        metric_for_best_model="eval_loss",
        greater_is_better=False,
        save_total_limit=save_total_limit,
        save_steps=save_step,
        eval_strategy="steps",
        eval_steps=save_step,
        per_device_eval_batch_size=per_device_eval_batch_size,
    ),
)


Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/127125 [00:00<?, ? examples/s]

Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/2595 [00:00<?, ? examples/s]

In [None]:
# =========================
# Monitor GPU Memory Usage
# =========================

gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

# =========================
# Start Training
# =========================

trainer_stats = trainer.train()

The model is already on multiple devices. Skipping the move to device specified in `args`.


GPU = Tesla T4. Max memory = 14.741 GB.
6.883 GB of memory reserved.


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 127,125 | Num Epochs = 2 | Total steps = 15,892
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 4 x 1) = 16
 "-____-"     Trainable parameters = 167,772,160 of 8,198,033,408 (2.05% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss,Validation Loss
