Skip to content

Single-device direct run causes ~60s re-dispatch loop because queueRun() does not update job status on dispatch #1

@Kikk79

Description

@Kikk79

Summary

When a script is started via the single-device path (Run on a node, no
Batch Deploy), the server keeps re-dispatching the same job to the
agent every 60 seconds for as long as the script is still running.
The agent obeys each dispatch and spawns a fresh PowerShell instance,
so multiple copies of the same script run in parallel and clobber each
other's state.

This bites long-running scripts hard. The script never gets the chance
to finish before the next dispatch arrives.

Symptom

Reproducing on master: a long-running PowerShell script (about 60 s
total runtime) on a single Windows 11 node produces a series of
overlapping log files, each 1–3 minutes apart. Inside Task Manager,
multiple powershell.exe processes from the plugin show up in
parallel. Only the last instance to finish writes its result line
back to the server, and that result reflects whatever state the node
was in by then — earlier instances that did real work usually do not
get to report at all.

Root cause

innovoscripttask.js master branch:

  • L47 — obj.intervalTimer = setInterval(obj.queueRun, 1 * 60 * 1000)
    Queue runner ticks every 60 s.
  • L90–116 — queueRun() reads pending jobs, dispatches each to its
    agent, and updates only dispatchTime:
    obj.meshServer.webserver.wsagents[job.node].send(JSON.stringify(jObj));
    obj.db.update(job._id, { dispatchTime: dispatchTime });
    The job status field is not updated. The job remains "pending".
  • L308–314 — nodeTimeoutSec, batchTimeoutSec, staggerSec,
    batchIntervalSec apply only to the Batch Deploy path; single-device
    runs bypass them entirely (per code structure and confirmed in UI:
    these timeouts are only configurable in the Batch Deploy dialog).

modules_meshcore/scripttask.js master branch:

  • L168–171 — agent spawns PowerShell via child_process.execFile(...)
    for every incoming dispatch, with no de-duplication against an
    already-running job for the same _id.
  • No setTimeout / setInterval watchdog and no heartbeat back to
    the server while a script is running. Result reaches the server only
    via finalizeJob() in the child-process exit handler.

So the loop is:

  1. User starts a single-device run → job inserted, status pending.
  2. queueRun() ticks → dispatches to agent, sets dispatchTime only.
  3. Agent spawns PowerShell instance Single-device direct run causes ~60s re-dispatch loop because queueRun() does not update job status on dispatch #1.
  4. 60 s later: queueRun() ticks again. Job is still pending, so
    it gets dispatched again.
  5. Agent spawns PowerShell instance #2 in parallel.
  6. The two (or more) instances race over whatever shared resources
    the script touches.

Reproduction

  1. Pick any PowerShell script whose runtime exceeds 60 s. Have it
    write a timestamped logfile per launch so concurrent launches are
    visible.
  2. Configure the script for a single device.
  3. Start the run via "Run on device" / direct execute, not via
    Batch Deploy.
  4. Observe: a new logfile appears roughly every 60–120 s for as long
    as the script keeps running. Multiple powershell.exe processes
    appear in Task Manager on the target node.

Suggested fix

In queueRun(), mark the job as dispatched at dispatch time so the
next tick skips it, e.g.:

obj.db.update(job._id, {
  dispatchTime: dispatchTime,
  status: 'dispatched'
});

…and adjust getPendingJobs() to filter out status === 'dispatched',
plus add a stale-dispatch sweep that re-pends a job if dispatchTime
is older than some threshold (mirrors the batch path's
nodeTimeoutSec semantics).

A simpler patch that does not touch the DB schema is to keep an
in-memory dispatchedJobIds Set and skip jobs already in it from the
next queueRun() tick, cleared by finalizeJob(). Less robust across
restarts but a one-line change.

Workaround

Always start runs via Batch Deploy, even for a single device. The
batch path uses a run object with per-node status tracking and does
not re-dispatch. The Batch Deploy dialog exposes the relevant
timeouts (nodeTimeoutSec, batchTimeoutSec).

Environment

  • Plugin: InnovoDeveloper/MeshCentral-ScriptTask, branch master.
  • MeshCentral: stock build with this plugin loaded.
  • Agent OS: Windows 11.
  • Script type: long-running PowerShell, sequential, runtime in the
    60–90 s range.

Side note

The README says

Agent heartbeat — devices send heartbeats every 30s while scripts
run, preventing false timeouts

I could not find a corresponding implementation in
modules_meshcore/scripttask.js. If the heartbeat lives in a sibling
file (modules_meshcore/innovoscripttask.js), a README pointer would
help — and the heartbeat would arguably also be the right place to
prevent the re-dispatch loop described above.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions