Your machine trains a neural net on HPC job runtime data and exposes it as a prediction server. Remote users submit job configs and get back a predicted runtime + scheduling verdict — without needing TensorFlow or the training data.
Remote user A ─┐
Remote user B ─┼──► POST /predict ──► YOUR SERVER ──► model.predict()
Remote user C ─┘ │
└──► AI scheduler ──► SLURM/PBS
pip install -r requirements.txt
# 1. Train the model and save all artifacts
python train_and_save.py
# 2. Start the prediction server
uvicorn prediction_server:app --host 0.0.0.0 --port 8000Server is now live at http://YOUR_IP:8000
Check it's running:
curl http://localhost:8000/health
# → {"status":"ok","model":"miniVite_pass2"}See valid machines/apps:
curl http://localhost:8000/schema# Only needs requests — no TensorFlow required
pip install requests
# Single job
python ai_scheduler.py \
--server http://localhost:8000/ \
--machine summit \
--app miniVite \
--ranks 64 \
--nodes 4 \
--threads 4 \
--scale 20 \
--avg-degree 16 \
--base-time 142python ai_scheduler.py
--server http://YOUR_SERVER_IP:8000
--machine summit
--app miniVite
--ranks 64
--nodes 4
--threads 4
--scale 20
--avg-degree 16
--base-time 142
Output:
────────────────────────────────────────────────────────
Job : (no id)
Machine / App : summit / miniVite
Ranks / Nodes : 64 / 4
Graph scale : 20M vertices
────────────────────────────────────────────────────────
Predicted Δ : 1.32× (1.122–1.518)
Est. wall time: 187.4s
Inference : 12.3ms
Verdict : SCHEDULE_NOW
Reason : Predicted cost is low — safe to run immediately.
────────────────────────────────────────────────────────
python ai_scheduler.py \
--server http://YOUR_SERVER_IP:8000 \
--batch example_jobs.jsonpython ai_scheduler.py --server http://YOUR_SERVER_IP:8000 \
--machine frontier --ranks 256 --scale 80 --dry-runfrom ai_scheduler import AIScheduler, JobConfig
scheduler = AIScheduler("http://YOUR_SERVER_IP:8000")
job = JobConfig(
machine="summit", app="miniVite",
ranks=64, nodes=4, threads_per_rank=4,
graph_scale_M=20, avg_edges_per_vertex=16,
base_runtime_s=142.0,
job_id="my-job-001",
)
result = scheduler.submit(job)
print(result["verdict"]) # "SCHEDULE_NOW"
print(result["est_wall_time_s"]) # 187.4| File | Where it runs | Purpose |
|---|---|---|
train_and_save.py |
your machine | trains pass-1 + pass-2 model, saves all artifacts |
prediction_server.py |
your machine | FastAPI server, loads model, serves /predict |
ai_scheduler.py |
remote user | client, calls /predict, submits to SLURM |
example_jobs.json |
remote user | sample batch input |
model_output/ |
your machine | miniVite_pass2.keras + preprocessor.pkl + feature_names.json |
In ai_scheduler.py, replace the _queue_job method body with your actual
SLURM/PBS call:
def _queue_job(self, config: JobConfig, priority: str):
import subprocess
cmd = [
"sbatch",
f"--job-name={config.app}",
f"--nodes={config.nodes}",
f"--ntasks={config.ranks}",
f"--qos={priority}",
"run_job.sh",
]
subprocess.run(cmd, check=True)