| title | openenv-dsc-co | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| emoji | 📦 | ||||||||||
| colorFrom | indigo | ||||||||||
| colorTo | red | ||||||||||
| sdk | docker | ||||||||||
| app_port | 7860 | ||||||||||
| pinned | true | ||||||||||
| license | apache-2.0 | ||||||||||
| tags |
|
||||||||||
| short_description | 30-step supply chain rlvr env with pulp milp oracle |
dynamic supply chain combinatorial orchestration. a meta openenv-compliant rlvr/rlve environment. a 30-step multi-echelon supply chain graph verified by a deterministic pulp/cbc mixed-integer linear programming oracle. 100% api/json driven. single unprivileged docker container. hf space.
| artifact | url |
|---|---|
| live hf space (env server) | https://huggingface.co/spaces/AceofStades/dsc_co |
| hf space (training node) | https://huggingface.co/spaces/AceofStades/openenv-dsc-co-training |
| github source | https://github.com/CYCLOP5/metascaler-hack |
| trained lora adapter | https://huggingface.co/AceofStades/dsc-co-grpo-lora |
| final training curve | https://huggingface.co/AceofStades/dsc-co-grpo-lora/blob/main/training_curve.png (uploaded with adapter) |
| final metrics artifacts | [results/training_metrics.csv](results/training_metrics.csv), [results/training_metrics.json](results/training_metrics.json), [results/training_summary.json](results/training_summary.json) |
| trackio live training dashboard | https://huggingface.co/spaces/AceofStades/dsc-co-trackio (separate dashboard Space in trackio_space/) |
| blog post | BLOG.md |
| demo video | prefer BLOG.md for the full writeup; if you prefer video, https://youtu.be/bnp1fBDqrKI |
start here, then jump into whichever md interests you most:
| doc | what it shows |
|---|---|
| BOOTSTRAP.md | full setup, local eval, Space deploy, and A100 training path |
| BLOG.md | submission narrative: problem, verifier, training loop, proof |
| trackio_space/README.md | live Trackio dashboard Space for reward/loss monitoring |
| docs/architecture.md | runtime architecture, trainer/env/server data flow |
| docs/reward-spec.md | reward components, dense cap, terminal verifier signal |
| docs/milp-formulation.md | exact min-cost-flow MILP solved by CBC |
| docs/curriculum.md | four procedural difficulty tiers and reproducibility |
| docs/behavior-trace.md | qualitative before/after behavior trace for non-technical judges |
| docs/anti-hacking.md | hard gates against reward/specification hacking |
Open the runnable environment Space: https://huggingface.co/spaces/AceofStades/dsc_co
Click Reset, then use Step with these exact values:
Type:
call_tool
Tool Name:
query_network
Arguments:
{"source_id": "S0", "dest_id": "W0"}
Then ship inventory on the discovered edge:
Type:
call_tool
Tool Name:
dispatch_inventory
Arguments:
{"routes":[{"src":"S0","dst":"W0","qty":20}]}
Then advance the simulator:
Type:
call_tool
Tool Name:
advance_cycle
Arguments:
{}
Equivalent raw action payload:
{"type":"call_tool","tool_name":"query_network","arguments":{"source_id":"S0","dest_id":"W0"}}| aspect | value |
|---|---|
| action space | 3 mcp tools, strict pydantic v2 validation |
| observation | typed json, partial-observable via query_network |
| horizon | 30 discrete steps |
| reward | dense shaping (≤ 0.4) + terminal clip(opt/agent, 0, 1) |
| verifier | coin-or cbc milp, zero-variance signal |
| trainer | trl grpo + unsloth + llama-3.2-3b-instruct qlora |
| curriculum | 4 procedurally-generated tiers with ema gating |
plain english: this is a store-shelf problem. the planner has to keep retailers stocked without wasting warehouse and supplier inventory too early. a bad planner panic-ships whatever is closest, looks good for a few turns, then runs out later and pays shortage penalties. a better planner sends smaller shipments early enough for lead time and keeps replenishment moving.
llms default to step-wise greedy decisions. give a 7b instruct model a 30-step supply chain and early moves permanently truncate the viable solution space. this env measures and trains through that failure mode with a zero-variance, math-optimal reward.
reward path diagram
before and after behavior trace
caption: concrete tier-1 seed-7 replay. the reactive baseline over-ships early and ends at terminal reward 0.423; the planned verifier replay sends smaller timed batches and reaches 0.959. see docs/behavior-trace.md. the final trained run improved terminal MILP reward from 0.052 to 0.226; raw trained completions were not preserved because DSC_LOG_COMPLETIONS=0.
| measurement | value |
|---|---|
| baseline greedy gap (tier 1, n=5) | 159% |
| baseline zero-op gap (tier 1, n=5) | 448% |
| milp-replay gap (tier 1, n=5) | 7% |
| gradient headroom | ~0.55 terminal reward points |
baseline terminal reward
Final GRPO evidence run: 400 steps, 2,000 prompts, 8 generations per prompt, max_completion_length=768, Llama-3.2-3B-Instruct 4-bit QLoRA via Unsloth on an A100 Space. Raw artifacts are committed under results/ and mirrored in the LoRA repo.
| metric | first logged step | final step | best / aggregate |
|---|---|---|---|
| combined reward | 0.622 | 1.304 | max 1.365 |
| cumulative env reward | 0.505 | 0.852 | last-25 mean 0.855 |
| terminal MILP reward | 0.052 | 0.226 | max 0.241 |
| reward std | 0.387 | 0.079 | frac_reward_zero_std=0 at final |
| GRPO train loss | — | -0.049 | 400-step run |
| runtime | — | 4h 51m | 0.023 steps/sec |
final GRPO training curve
caption: the model moves from sparse/low terminal verifier reward to stable non-zero MILP terminal reward while maintaining non-zero reward variance and gradients. The high completion clipping rate reflects inefficient stopping, but terminal reward remains active because valid parsed actions reach the 30-step verifier.
reward components
caption: the final metrics CSV is preserved under results/, so this plot is reproducible without relying on transient Space logs or screenshots.
training health
caption: reward variance remains non-zero and KL stays controlled through the final run.
completion diagnostics
caption: the model often uses the full completion budget, but parsed action sequences still reach terminal verifier scoring.
For a full setup guide, including local environment checks, OpenEnv deployment, and the intended Hugging Face A100 training workflow, see BOOTSTRAP.md.
make install
make test
make eval N=10 TIERS="1 2"
make viz
make serve
then from a second shell:
python client.py reset --tier 1 --seed 7
python client.py query S0 W0
python client.py dispatch S0 W0 50
python client.py advance
python client.py tools
live hf space: https://huggingface.co/spaces/AceofStades/dsc_co
reproduce the deploy:
huggingface-cli login
openenv push -r AceofStades/dsc_co --exclude .openenvignore
-r (aka --repo-id) takes username/env-name. --exclude .openenvignore is required — the cli's default ignore is only .*, __pycache__, *.pyc, so your local env/ venv would otherwise upload (~400 mb of compiled .so + cbc binaries = 500 error from hf).
optional flags: --private, --base-image ghcr.io/meta-pytorch/openenv-base:latest, --hardware cpu-basic, --env-var KEY=VAL, --secret KEY=VAL.
manual docker alternative:
docker build -t openenv-dsc-co .
docker run --rm -p 7860:7860 openenv-dsc-co
the hf space uses port 8000 by default (openenv.yaml). the root Dockerfile is wired for port 7860 if you deploy directly as a plain docker space.
we use huggingface spaces as on-demand, high-vram gpu compute nodes for grpo training. The intended full training path is the root Dockerfile on a Hugging Face A100 Docker Space. The notebooks in notebooks/ are provided for reference and experimentation, but the final evidence run was trained through the Docker Space workflow.
workflow:
- create a new space on huggingface (select A100 gpu hardware for the final/full run, Docker sdk).
- configure space secrets:
HF_TOKEN— a write-scope token from https://huggingface.co/settings/tokensDSC_HF_REPO—AceofStades/dsc-co-grpo-lora(where the trained lora will be pushed)
- push this codebase to the space:
git remote add space https://huggingface.co/spaces/<your-user>/<your-space>
git push space master:main- the space will build the
Dockerfile, spin up a lightweight fastapi server (app.py) to satisfy the space's port 7860 health check, and kick offtrain.pyin the background. check the space logs to watch unsloth run!
training stack:
unsloth/Llama-3.2-3B-Instruct-bnb-4bit4-bit qlora, r=32- defaults:
num_generations=4,max_completion_length=512,beta=0.04; override withDSC_NUM_GEN,DSC_MAX_COMPLETION,DSC_BETA - final-run knobs:
DSC_MAX_STEPS,DSC_DATA_N,DSC_BATCH_SIZE,DSC_GRAD_ACCUM,DSC_LR,DSC_EPOCHS,DSC_TEMP,DSC_SAVE_STEPS - checkpoint recovery: set
DSC_RESUME=1to resume the latest checkpoint inDSC_OUT_DIR - vllm/fast inference is enabled only when the GPU supports it and
vllmis installed; otherwise training falls back to the standard unsloth path - when TRL does not pass environments into reward functions,
train.pylocally replays JSON tool actions throughDSCToolEnvso rewards, loss, and gradients remain non-zero - final model uploads include
training_metrics.json,training_metrics.csv, andtraining_curve.pngalongside the LoRA adapter trackio.log({...})streams metrics to the trackio dashboard.- set
DSC_LOG_COMPLETIONS=1on future runs to preserve exact trained JSON action traces for side-by-side qualitative demos.
final HF Space evidence run preset:
DSC_MAX_STEPS=400
DSC_DATA_N=2000
DSC_NUM_GEN=8
DSC_MAX_COMPLETION=768
DSC_SAVE_STEPS=50
DSC_RESUME=0
DSC_DEBUG=0
DSC_LOG_COMPLETIONS=0
DSC_TRACKIO=openenv-dsc-co
DSC_TRACKIO_SPACE=AceofStades/dsc-co-trackioreload the trained adapter anywhere with:
from unsloth import FastLanguageModel
model, tok = FastLanguageModel.from_pretrained(
"AceofStades/dsc-co-grpo-lora",
max_seq_length=8192, load_in_4bit=True, fast_inference=True,
)
FastLanguageModel.for_inference(model)canonical openenv multi-mode deployment layout:
openenv-dsc-co/
├── BOOTSTRAP.md full setup, local eval, deployment, and A100 training guide
├── pyproject.toml package metadata, OpenEnv/core deps, pytest pythonpath
├── uv.lock pinned local/OpenEnv dependency resolution
├── openenv.yaml OpenEnv manifest: app server.app:app on port 8000
├── __init__.py package marker for OpenEnv packaging
├── models.py root re-export shim for OpenEnv structural checks
├── client.py CLI for reset/query/dispatch/advance/tools/health
├── app.py training Space health server that starts train.py in a thread
├── README.md main judge-facing overview, links, examples, results
├── BLOG.md short submission writeup / narrative
├── Makefile common install/test/eval/viz/serve/docker/train commands
├── Dockerfile GPU training Space image; launches app.py then train.py
├── .gitignore local cache/output ignore rules
├── .dockerignore keeps local envs, outputs, and trackio_space out of training image
├── .openenvignore keeps training-only/local files out of OpenEnv env pushes
├── requirements.txt environment/runtime deps: fastapi, openenv, pulp, matplotlib
├── requirements-train.txt cuda training deps: torch, trl, unsloth, trackio
├── server/
│ ├── __init__.py
│ ├── app.py OpenEnv/FastAPI app; browser JSON-argument adapter; fallback routes
│ ├── Dockerfile openenv-base multi-stage build for `openenv push`
│ ├── dsc_environment.py DSCEnv(MCPEnvironment) + 4-tier curriculum + 3 fastmcp tools
│ ├── models.py pydantic v2 schemas (DSCAction RootModel, strict int qty)
│ ├── solver.py pulp time-expanded min-cost flow + greedy baseline
│ └── policies.py zero_op, greedy, optimal_replay baseline rollouts
├── tests/
│ ├── __init__.py
│ ├── test_models.py strict-int qty, action envelope parsing, observation schema
│ ├── test_env.py reset shapes, anti-hack gates, valid flow, horizon termination
│ └── test_solver.py milp correctness, tier shapes, bipartite edges
├── notebooks/
│ ├── train_hf_space.ipynb reference notebook; final training used root Dockerfile on HF A100
│ └── train_kaggle.ipynb alternate experiment notebook, not the canonical final run path
├── docs/
│ ├── architecture.md
│ ├── reward-spec.md
│ ├── milp-formulation.md
│ ├── curriculum.md
│ ├── behavior-trace.md
│ └── anti-hacking.md
├── assets/
│ ├── before_after_trace.svg qualitative reactive vs planned behavior trace
│ ├── reward_path.svg reward/verifier flow diagram
│ ├── completion_diagnostics.png completion length and clipping diagnostics
│ ├── gap_hist.png baseline optimality gap histogram
│ ├── reward_components.png final run reward component curves
│ ├── training_health.png reward variance and KL diagnostics
│ ├── terminal_bars.png baseline terminal reward comparison
│ └── training_curve.png final GRPO reward/loss curve
├── results/
│ ├── training_metrics.csv final LoRA repo metrics export
│ ├── training_metrics.json final LoRA repo metrics export
│ └── training_summary.json final run summary
├── stufftodo/
│ └── demo_animation.html local animation page used for the demo video
├── trackio_space/
│ ├── README.md Trackio dashboard Space card/deploy notes
│ ├── app.py live Trackio dashboard app
│ └── requirements.txt Trackio dashboard dependency
├── train.py HF Space GRPO training, local replay fallback, artifact upload
├── eval.py deterministic baseline rollout harness -> eval.json
└── viz.py renders gap, terminal, trajectory, and training curves
deep dive: docs/architecture.md
| tool | args | semantics |
|---|---|---|
query_network |
source_id: str, dest_id: str |
returns {exists, lead_time, unit_cost, capacity} |
dispatch_inventory |
routes: [{src, dst, qty}], max 8 |
strict int qty ≥ 1; deducts inv, schedules shipment |
advance_cycle |
none | ticks time, processes arrivals, deducts demand, accrues costs; finalize at step 30 |
max 5 calls per cycle; advance_cycle resets the per-cycle counter.
{
"step": 0,
"network_status": "nominal" | "disrupted",
"nodes": [
{"id", "type": "supplier" | "warehouse" | "retail",
"inventory", "max_capacity", "holding_cost", "demand_forecast"}
],
"pipeline": [{"src", "dst", "qty", "arrival_step"}],
"reward": float, "done": bool,
"metadata": {"tier", "agent_cost", "optimal_cost", "terminal", "calls_this_cycle"}
}
deep dive: docs/reward-spec.md
| component | type | value | trigger | cap |
|---|---|---|---|---|
| r_schema | dense | +0.05 | valid pydantic-parsed tool call | sum dense ≤ 0.4 |
| r_valid | dense | +0.10 | dispatch with existing edge + inv sufficient | sum dense ≤ 0.4 |
| r_terminal | sparse | clip(opt/agent, 0, 1) |
step == 30 | — |
| r_neg_exploit | terminal | −1.0 + done | qty ≤ 0 or float | — |
| r_phantom_edge | terminal | 0 + done | dispatch over edge not in adjacency | — |
deep dive: docs/curriculum.md
| tier | suppliers | warehouses | retail | lead time | demand | disruptions |
|---|---|---|---|---|---|---|
| 1 | 1 | 1 | 1 | L=1 | static | none |
| 2 | 3 | 5 | 10 | L=1 | gaussian | none |
| 3 | 5 | 10 | 20 | L∈[1..5] | gaussian | capacity jitter |
| 4 | 7 | 14 | 28 | L∈[1..7] | seasonal | severe strikes |
deep dive: docs/milp-formulation.md
min Σ_{e,t} c_e · x[e,t]
+ Σ_{n,t} h_n · I[n,t]
+ Σ_{n,t} P · u[n,t]
s.t. I[n, 0] = I0_n
I[n, t+1] = I[n, t] + arrivals(n, t) − departures(n, t) − d[n, t] + u[n, t]
I[n, t] ≤ cap_n
x[e, t] ≤ cap_e
Σ_{e: src=s, t} x[e, t] ≤ sup_cap (supplier)
arrivals(n, t) = Σ_{e: dst=n, t−L_e ≥ 0} x[e, t − L_e]
solver: pulp.PULP_CBC_CMD(msg=0, timeLimit=30).
deep dive: docs/anti-hacking.md
| vector | defense |
|---|---|
| negative / zero / float qty | pydantic strict=True, ge=1 + pre-mutation _is_underflow_qty check → reward −1.0, done |
| cyclic reward farming | MAX_CALLS_PER_CYCLE=5 + DENSE_CAP=0.4; holding cost > dense reward |
| phantom edge hallucination | immutable _adjacency: frozenset built at reset → dispatch off-graph ends episode |
make eval N=20 TIERS="1 2 3"
make viz
produces:
eval.jsonwith per-rollout cost, gap, terminalassets/gap_hist.pngper-tier gap histogramsassets/terminal_bars.pngmean terminal reward bars by policy × tier
baseline optimality gap histogram
caption: non-trained baseline policies leave large optimality gaps; optimal_replay is the MILP-derived upper-bound behavior.
baseline terminal reward by policy
caption: terminal reward has clear headroom for RL; greedy behavior is far below the MILP replay ceiling on tier 1.
baseline numbers on 5 seeds (tier 1 / tier 2):
| policy | tier 1 gap / terminal | tier 2 gap / terminal |
|---|---|---|
| zero_op | 4.48 / 0.19 | 1.01 / 0.51 |
| greedy | 1.59 / 0.39 | 0.35 / 0.75 |
| optimal_replay | 0.07 / 0.94 | 0.02 / 0.98 |
greedy ↔ optimal_replay gap is the rl learning target (~0.55 terminal reward points on tier 1).
make test
43 tests across models, env, solver:
- strict-int qty rejection (5 cases)
- action envelope parsing (3 tool kinds + rejection)
- tier 1/2 shape invariants + determinism
- all 3 anti-hack gates under step()
- horizon termination with milp finalize
- milp correctness:
optimal_cost ≤ greedy_coston random tier-1 scenarios - bipartite edge topology
meta pytorch openenv team. huggingface trl team. unsloth team. coin-or cbc. apache-2.0.