run each agent in its own subprocess #4381

rhysh · 2025-12-16T20:10:51Z

We'll land this as three separate commits. But here's the whole stack for CI and review.

datadog-official · 2025-12-16T20:14:43Z

✅ Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: b7a5e5a | Docs | Was this helpful? Give us feedback!}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2025-12-16T20:16:03Z

packages/mettagrid/python/src/mettagrid/policy/run_one_agent.py

+    policy_class = load_symbol(resolve_policy_class_path(policy_spec.class_path))
+    policy = policy_class(policy_env_info, **(policy_spec.init_kwargs or {}))  # type: ignore[call-arg]
+
+    agent = policy.agent_policy(agent_id=agent_id)


Load policy data in subprocess execution path

When run_agents_in_subprocesses is enabled (the new default), policies are launched through run_one_agent.py, but the subprocess path instantiates the policy and immediately grabs agent_policy without invoking policy.load_policy_data even if PolicySpec.data_path is provided (lines 61‑64). The in-process path used to load the checkpoint right after construction, so any policy that depends on saved weights will now run with default/uninitialized parameters whenever subprocess mode is used, yielding incorrect actions for checkpointed evaluations.

Useful? React with 👍 / 👎.

graphite-app · 2025-12-16T22:44:10Z

packages/mettagrid/python/src/mettagrid/policy/policy.py

+        action = self._process.stdout.readline().strip()
+        self._step += 1
+        return Action(name=action)


No error handling if subprocess has crashed or closed stdout. readline() will return empty string if the subprocess is dead, resulting in Action(name="") being returned instead of detecting the failure.

Fix:

action = self._process.stdout.readline().strip() if not action: raise RuntimeError(f"Subprocess for agent {self._agent_id} crashed or closed connection") self._step += 1 return Action(name=action)

Suggested change

action = self._process.stdout.readline().strip()

self._step += 1

return Action(name=action)

action = self._process.stdout.readline().strip()

if not action:

raise RuntimeError(f"Subprocess for agent {self._agent_id} crashed or closed connection")

self._step += 1

return Action(name=action)

Spotted by Graphite Agent

Is this helpful? React 👍 or 👎 to let us know.

What would we even do in response to a failure? We'd replace the actions with "noop", and maybe report it somewhere? There's no place to report. And invalid commands will be replaced with noop anyway.

graphite-app · 2025-12-16T22:44:12Z

packages/mettagrid/python/src/mettagrid/policy/run_one_agent.py

+        parts = observation_line.split(" ")
+        raw_tokens = parts[2:]  # Skip agent ID and step number


No validation of observation line format. If the line has fewer than 2 space-separated parts, parts[2:] will return an empty list silently. The agent_id and step number from the observation line are extracted but never validated against expected values.

Fix:

parts = observation_line.split(" ") if len(parts) < 2: raise RuntimeError(f"Malformed observation line: {observation_line}") obs_agent_id = int(parts[0], 16) obs_step = int(parts[1], 16) if obs_agent_id != agent_id: raise RuntimeError(f"Agent ID mismatch: expected {agent_id}, got {obs_agent_id}") raw_tokens = parts[2:]

Spotted by Graphite Agent

Is this helpful? React 👍 or 👎 to let us know.

The line-based protocol is a dirty hack. That change won't make it meaningfully less of a hack.

graphite-app · 2025-12-19T19:18:37Z

packages/mettagrid/python/src/mettagrid/policy/policy.py

+            worker = os.path.join(os.path.dirname(__file__), "run_one_agent.py")
+            self._socket, child = socket.socketpair()
+            child.set_inheritable(True)
+            self._file = self._socket.makefile(mode="rw")


Using mode="rw" with socket.makefile() is invalid. Python's socket.makefile() doesn't support "rw" mode - it only accepts "r", "w", "rb", "wb" and similar. This will raise a ValueError at runtime. For bidirectional communication, either use separate file objects for reading and writing, or use "r+b" for binary mode (though text mode doesn't support "+").

# Fix: Use separate file objects or handle buffering carefully self._read_file = self._socket.makefile(mode="r") self._write_file = self._socket.makefile(mode="w") # Then use _read_file for readline() and _write_file for write()

Suggested change

self._file = self._socket.makefile(mode="rw")

self._read_file = self._socket.makefile(mode="r")

self._write_file = self._socket.makefile(mode="w")

Spotted by Graphite Agent

Is this helpful? React 👍 or 👎 to let us know.

No ValueError is raised. Passing "rw" works in practice.

graphite-app · 2025-12-19T19:27:27Z

packages/mettagrid/python/src/mettagrid/policy/policy.py

+    def _destroy(self):
+        if self._process is not None:
+            self._process.kill()
+        if self._socket is not None:
+            self._socket.close()


Resource leak: self._file is not closed before closing the socket. The file wrapper should be closed first to ensure buffers are flushed and the file descriptor is properly released.

def _destroy(self): if self._file is not None: self._file.close() if self._process is not None: self._process.kill() if self._socket is not None: self._socket.close()

Suggested change

def _destroy(self):

if self._process is not None:

self._process.kill()

if self._socket is not None:

self._socket.close()

def _destroy(self):

if self._file is not None:

self._file.close()

if self._process is not None:

self._process.kill()

if self._socket is not None:

self._socket.close()

Spotted by Graphite Agent

Is this helpful? React 👍 or 👎 to let us know.

The docs say that closing the wrapper won't close the underlying fd. It seems that the fd is the only thing that really needs to be closed.

graphite-app · 2025-12-19T19:27:29Z

packages/mettagrid/python/src/mettagrid/policy/policy.py

+            ready = self._file.readline().strip()
+            if ready != "READY":
+                self._destroy()
+                raise RuntimeError("Failed to start agent subprocess")


The readline() call could hang indefinitely if the subprocess crashes during initialization but doesn't close the socket. Should add a timeout or check if the subprocess is still alive. Additionally, if readline() returns empty string (subprocess died), the error message won't capture any stderr output from the subprocess that could explain what went wrong.

ready = self._file.readline().strip() if ready != "READY": # Check if process died and try to get error info if self._process.poll() is not None: raise RuntimeError(f"Agent subprocess died during initialization (exit code: {self._process.returncode})") self._destroy() raise RuntimeError("Failed to start agent subprocess")

Spotted by Graphite Agent

Is this helpful? React 👍 or 👎 to let us know.

if the subprocess crashes during initialization but doesn't close the socket

The parent closes its copy of the fd after passing it to the child. The child has the last remaining copy of the fd, so the OS will close it if the child dies. And with no mechanism for bubbling errors upwards anyway, I'm not sure it's worthwhile to collect more detail at this level.

keep the funky line-based protocol for now

nishu-builder · 2025-12-19T22:55:58Z

packages/mettagrid/python/src/mettagrid/policy/policy.py


    init_kwargs: dict[str, Any] = Field(default_factory=dict)

+    python_path: list[str] = Field(default=[], description="Optional PYTHONPATH entries to add when loading the policy")


I like this change

Not a bad time to change the docstring to say something like "Specification for a locally initializable policy"

nishu-builder · 2025-12-19T23:02:45Z

packages/mettagrid/python/src/mettagrid/policy/loader.py

+        policy_spec.init_kwargs = kwargs
+
+    if sandbox:
+        policy = PipedPolicyWrapper(policy_spec, policy_env_info, **kwargs)  # type: ignore[call-arg]


I wonder if having callers of initialize_or_load_policy(..., sandbox=True) instead pass a PolicySpec(policy_class=path.to.PipedPolicyWrapper). The upside is keeping initialize_or_load_policy simple, which may not be worth it; in the current format, we'd need to them specify the substantive policy spec as init_kwargs to the PipedPolicyWrapper, which seems awful

My guess is that what you have written is the better option, and unless my rambling comment makes you feel otherwise, ignore it

nishu-builder · 2025-12-19T23:03:12Z

packages/mettagrid/python/src/mettagrid/policy/loader.py

    """

-    policy_class = load_symbol(resolve_policy_class_path(policy_spec.class_path))
+    policy_spec = policy_spec.model_copy(deep=True)


thanks for this

nishu-builder · 2025-12-19T23:05:00Z

packages/mettagrid/python/src/mettagrid/policy/loader.py

+    if sandbox:
+        policy = PipedPolicyWrapper(policy_spec, policy_env_info, **kwargs)  # type: ignore[call-arg]
+    else:
+        for path in reversed(policy_spec.python_path):


I think moving this to be within this function, instead of its caller, makes sense; thanks for the change

nishu-builder · 2025-12-19T23:06:36Z

packages/mettagrid/python/src/mettagrid/policy/loader.py

        kwargs["device"] = device_override
+        policy_spec.init_kwargs = kwargs
+
+    if sandbox:


I wonder if we should call this subprocess_piped or piped or something instead of sandbox.

nishu-builder · 2025-12-19T23:07:41Z

packages/mettagrid/python/src/mettagrid/policy/policy.py

+            # TODO(rhys): we'll want to hear about the end of the simulation so we know when to kill
+            self._destroy()  # This is still a single-use wrapper, don't reset
+        else:
+            worker = os.path.join(os.path.dirname(__file__), "run_one_agent.py")


I don't feel strongly, but run_one_agent.py could define SELF_PATH = __file__ and this could import it, so that if it moves this impl doesn't break

nishu-builder · 2025-12-19T23:10:18Z

packages/mettagrid/python/src/mettagrid/policy/policy.py

+        # That's our cue to create the subprocess.
+        if self._process is not None:
+            # TODO(rhys): we'll want to hear about the end of the simulation so we know when to kill
+            self._destroy()  # This is still a single-use wrapper, don't reset


if we want to support reset getting called more than once (after init), then maybe _destroy should set self_process /etc to None. If we don't, then maybe this if suite should error instead of calling ._destroy()

nishu-builder · 2025-12-19T23:13:34Z

packages/mettagrid/python/src/mettagrid/policy/policy.py

+
+    def step(self, obs: AgentObservation) -> Action:
+        if self._process is None or self._file is None:
+            return Action(name="noop")


should this raise an error instead?

nishu-builder · 2025-12-19T23:20:28Z

packages/mettagrid/python/src/mettagrid/policy/run_one_agent.py

+    if len(setup_lines) < 3:
+        raise RuntimeError("Insufficient setup lines received")
+
+    agent_id = int(setup_lines[0], 16)


I have a slight preference for the setup step sending one line as json with agent_id, policy_spec, and policy_env_info

then this side doesn't have to wait until READY; it just responds READY after the first line. and we remove one instance of str-to-hex and hex-str-to-int

nishu-builder · 2025-12-19T23:21:05Z

packages/mettagrid/python/src/mettagrid/policy/run_one_agent.py

+    agent_id = int(setup_lines[0], 16)
+    policy_spec = PolicySpec.model_validate_json(setup_lines[1])
+
+    for path in reversed(policy_spec.python_path):


should this just call initialize_or_load_policy?

nishu-builder · 2025-12-19T23:22:02Z

packages/mettagrid/python/src/mettagrid/policy/run_one_agent.py

+
+    agent = policy.agent_policy(agent_id=agent_id)
+
+    assembler_protocols: list[ProtocolConfig] = []


i dont totally follow these lines -- can we change PolicyEnvInfo's serialize/deserialize to handle this itself?

github-actions · 2025-12-30T00:05:41Z

This PR has been marked as stale due to 10 days of inactivity.

nishu-builder

requesting changes to get this out of my graphite inbox; i figure you've got a round of merge conflict addressing + testing before it's ready for review, but let me know if not and if it's ready now

github-actions bot assigned rhysh Dec 16, 2025

chatgpt-codex-connector bot reviewed Dec 16, 2025

View reviewed changes

rhysh force-pushed the rhys/20251209-agent-subprocess-5 branch 3 times, most recently from a721a8d to e32ad14 Compare December 16, 2025 22:40

graphite-app bot reviewed Dec 16, 2025

View reviewed changes

rhysh force-pushed the rhys/20251209-agent-subprocess-5 branch 2 times, most recently from 85242c4 to 3c1e175 Compare December 19, 2025 19:14

graphite-app bot reviewed Dec 19, 2025

View reviewed changes

rhysh force-pushed the rhys/20251209-agent-subprocess-5 branch from 3c1e175 to c1117d1 Compare December 19, 2025 19:21

graphite-app bot reviewed Dec 19, 2025

View reviewed changes

rhysh force-pushed the rhys/20251209-agent-subprocess-5 branch from c1117d1 to 31768c5 Compare December 19, 2025 19:44

rhysh assigned nishu-builder Dec 19, 2025

rhysh added 4 commits December 19, 2025 13:00

delay sys.path manipulation when loading polcies

0911bc9

prepare to run agents in single-use subprocesses

e6e9e56

isolate policies for recipes.experiment.v0_leaderboard.evaluate

682fd44

move to socketpair for communicating with agent processes

b7a5e5a

keep the funky line-based protocol for now

rhysh force-pushed the rhys/20251209-agent-subprocess-5 branch from 31768c5 to b7a5e5a Compare December 19, 2025 21:00

nishu-builder reviewed Dec 19, 2025

View reviewed changes

github-actions bot added the stale 🥖 A bot has noticed this PR growing old. label Dec 30, 2025

rhysh mentioned this pull request Jan 3, 2026

Set up basics of protobuf for the monorepo #4586

Merged

nishu-builder requested changes Jan 6, 2026

View reviewed changes

github-actions bot removed the stale 🥖 A bot has noticed this PR growing old. label Jan 9, 2026

		parts = observation_line.split(" ")
		raw_tokens = parts[2:] # Skip agent ID and step number

	self._file = self._socket.makefile(mode="rw")
	self._read_file = self._socket.makefile(mode="r")
	self._write_file = self._socket.makefile(mode="w")


		init_kwargs: dict[str, Any] = Field(default_factory=dict)

		python_path: list[str] = Field(default=[], description="Optional PYTHONPATH entries to add when loading the policy")


		agent = policy.agent_policy(agent_id=agent_id)

		assembler_protocols: list[ProtocolConfig] = []

run each agent in its own subprocess #4381

Are you sure you want to change the base?

run each agent in its own subprocess #4381

Uh oh!

Conversation

rhysh commented Dec 16, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

datadog-official bot commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

graphite-app bot Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

graphite-app bot Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

graphite-app bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

graphite-app bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

graphite-app bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 30, 2025

Uh oh!

nishu-builder left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rhysh commented Dec 16, 2025 •

edited by github-actions bot

Loading

datadog-official bot commented Dec 16, 2025 •

edited

Loading