# GYLLM env API demo (text-in / text-out)

This notebook shows:
- Using a text-based env directly (you maintain message history)
- Using `TokenizedEnv` to convert text requests into token requests

Prerequisite: `gyllm` should be importable in your notebook kernel (e.g. `pip install -e .`).

In [1]:

import gyllm  # noqa: F401

In [None]:
from gyllm.envs.simple.iterated_games import IpdEnv, TftIpdEnv
from gyllm.batch import vectorize
from gyllm.rpc import subprocess_env
from gyllm.envs.tokenization import TokenizedEnv
from gyllm.core import ActorId, actor_agent, make_actor_id


## 1) Single-agent env (manual history)

The env returns a `Request`:
- `actor`: `(env_id, agent_id)`
- `reward`: reward from the last transition
- `message`: a single environment "user" message to append to your history
- `needs_action`: whether the agent should respond next

Note: action formatting is env-specific. `TftIpdEnv` expects actions in `<action>...</action>` tags.

In [None]:
env = TftIpdEnv(num_turns=3)
player = make_actor_id("player")

requests = env.reset()

req = requests[0]
history: list[dict[str, str]] = [req["system_message"], req["message"]]

print("actor:", req["actor"])
print("needs_action:", req["needs_action"])
print("last message:", history[-1])

In [None]:
def simple_agent_policy(messages: list[dict[str, str]]) -> str:
    # For demo purposes: always cooperate.
    return "<action>A</action>"


total_reward = 0.0

while True:
    action_text = simple_agent_policy(history)
    actions = {player: action_text}

    history.append({"role": "assistant", "content": action_text})

    requests = env.step(actions)
    if not requests:
        break

    req = requests[0]
    total_reward += req["reward"]

    history.append(req["message"])

    print("reward:", req["reward"], "| needs_action:", req["needs_action"])
    print("env says:", history[-1]["content"])

    if not req["needs_action"]:
        break

print("total_reward:", total_reward)

## 2) Two-agent env (separate histories per agent)

`IpdEnv` returns one request per agent each step.
Each agent gets its own private message stream (the env shares only actions).

In [None]:
env2 = IpdEnv(num_turns=2)

a = make_actor_id("player_a")
b = make_actor_id("player_b")

histories: dict[ActorId, list[dict[str, str]]] = {
    a: [],
    b: [],
}
requests = env2.reset()
for req in requests:
    histories[req["actor"]] = [req["system_message"], req["message"]]


def policy_a(messages: list[dict[str, str]]) -> str:
    return "<action>A</action>"


def policy_b(messages: list[dict[str, str]]) -> str:
    return "<action>B</action>"


totals = {"player_a": 0.0, "player_b": 0.0}

while True:
    actions = {a: policy_a(histories[a]), b: policy_b(histories[b])}

    histories[a].append({"role": "assistant", "content": actions[a]})
    histories[b].append({"role": "assistant", "content": actions[b]})

    requests = env2.step(actions)
    if not requests:
        break

    for req in requests:
        agent_name = actor_agent(req["actor"])
        totals[agent_name] += req["reward"]
        histories[req["actor"]].append(req["message"])
        print(req["actor"], "reward:", req["reward"], "| last:", histories[req["actor"]][-1]["content"])

    if all(not r["needs_action"] for r in requests):
        break

print("totals:", totals)

## 3) Tokenization on top (`TokenizedEnv`)

`TokenizedEnv` wraps a text env and:
- maintains per-agent histories internally
- appends agent completions as `"assistant"` messages by default
- returns `TokenRequest` objects with `prompt: list[int]`

In [None]:
def dummy_tokenize(messages: list[dict[str, str]], **kwargs) -> list[int]:
    # Demo tokenizer: returns one "token" equal to total character count.
    text = "".join(f"{m['role']}:{m['content']}\n" for m in messages)
    return [len(text)]


wrapped = TokenizedEnv(TftIpdEnv(num_turns=2), tokenize=dummy_tokenize)

token_reqs = wrapped.reset()
print("TokenRequest keys:", sorted(token_reqs[0].keys()))
print("prompt:", token_reqs[0]["prompt"], "| messages_in_prompt:", len(token_reqs[0]["messages"]))

token_reqs = wrapped.step({token_reqs[0]["actor"]: "<action>A</action>"})
print("step reward:", token_reqs[0]["reward"])
print("prompt:", token_reqs[0]["prompt"], "| messages_in_prompt:", len(token_reqs[0]["messages"]))

## 4) Vectorization (multi-world batching)

Vectorization is just “more actors”: each env copy gets its own `env_id`.
This uses the same API as multi-agent envs.

In [None]:
venv = vectorize(lambda: TftIpdEnv(num_turns=1), num_envs=4)
reqs = venv.reset()
print("actors:", [r["actor"] for r in reqs])

actions = {r["actor"]: "<action>A</action>" for r in reqs}
reqs2 = venv.step(actions)
print("step rewards:", {r["actor"]: r["reward"] for r in reqs2})

## 5) Dual-mode hosting (in-memory vs out-of-process)

The same env can be used:
- in-process: `env = TftIpdEnv(...)`
- out-of-process: `env = subprocess_env(...)`
- in Docker: `env = docker_env(image=..., env=..., env_kwargs=...)` (same client API; image must have `gyllm` installed, e.g. `uv pip install gyllm`)

Out-of-process is useful for isolation and for matching the “env in a container” deployment style.

In [None]:
remote = subprocess_env(
    env="gyllm.envs.simple.iterated_games:TftIpdEnv",
    env_kwargs={"num_turns": 1},
)
print("remote actors:", remote.actors)
reqs = remote.reset()
reqs2 = remote.step({reqs[0]["actor"]: "<action>A</action>"})
print("remote reward:", reqs2[0]["reward"])
remote.close()