# Environment Development ("Real State")

Here we'll show how we build an environment where at each state, we only see:  
- the system prompt
- the user's initial message (task or question prompt)
- the assistant's prior messages (thoughts + actions, or just actions)
- the last observation (i.e., what someone currently sees) 

Unlike popular "chat" or dialogue settings, where the state is just the entire conversation history, this results in a more realistic setting where humans only see one observation at a time (i.e., a new screen). However to make progress on the task, they still have prior context or can recall what their thoughts were beforehand.

We'll also have to consider the different way to loads the SFT observation.

### General Setup

In [1]:
from omegaconf import OmegaConf
from rich import print as rich_print

from transformers import AutoTokenizer

# Get a tokenizer
model_config = OmegaConf.load("../configs/model/hf_qwen3_4b_inst_2507.yaml")
hf_tokenizer = AutoTokenizer.from_pretrained(**model_config.model_config)

  from .autonotebook import tqdm as notebook_tqdm


In [50]:
hf_tokenizer.special_tokens_map

{'eos_token': '<|im_end|>',
 'pad_token': '<|endoftext|>',
 'additional_special_tokens': ['<|im_start|>',
  '<|im_end|>',
  '<|object_ref_start|>',
  '<|object_ref_end|>',
  '<|box_start|>',
  '<|box_end|>',
  '<|quad_start|>',
  '<|quad_end|>',
  '<|vision_start|>',
  '<|vision_end|>',
  '<|vision_pad|>',
  '<|image_pad|>',
  '<|video_pad|>']}

In [2]:
def rich_print_messages(
    msg_text: str,
    bos_token: str = "<|im_start|>",
    eos_token: str = "<|im_end|>\n",
    tool_call_bos_token: str = "<tool_call>",
    tool_call_eos_token: str = "</tool_call>",
    tool_response_bos_token: str = "<tool_response>",
    tool_response_eos_token: str = "</tool_response>",
):
    # Split into messages
    messages = msg_text.split(eos_token)

    system_bos = f"{bos_token}system"
    user_bos = f"{bos_token}user"
    assistant_bos = f"{bos_token}assistant"
    
    for ix, message in enumerate(messages):
        # system prompt
        if message.startswith(system_bos):
            messages[ix] = f"[bright_yellow]{message}[/bright_yellow]"
        # user messages
        elif message.startswith(user_bos):
            messages[ix] = f"[bright_red]{message}[/bright_red]"
        # assistant messages
        elif message.startswith(assistant_bos):
            messages[ix] = f"[bright_green]{message}[/bright_green]"
        
        # tool calls
        if tool_call_bos_token in messages[ix] and tool_call_eos_token in messages[ix]:
            messages[ix] = messages[ix].replace(tool_call_bos_token, f"[bright_cyan]{tool_call_bos_token}")
            messages[ix] = messages[ix].replace(tool_call_eos_token, f"{tool_call_eos_token}[/bright_cyan]")
        # tool responses
        if tool_response_bos_token in messages[ix] and tool_response_eos_token in messages[ix]:
            messages[ix] = messages[ix].replace(tool_response_bos_token, f"[bright_magenta]{tool_response_bos_token}")
            messages[ix] = messages[ix].replace(tool_response_eos_token, f"{tool_response_eos_token}[/bright_magenta]")
        
    msgs_text = eos_token.join(messages)
    try:
        rich_print(msgs_text)
    except:
        print(msgs_text)


### Test HotpotQA

In [3]:
from omegaconf import OmegaConf

from act_prm.environments import get_env

env_config = OmegaConf.load("../configs/environments/hotpotqa_mc/fewshot2_hide_obs.yaml")
env = get_env(**env_config)

Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00,  4.62it/s]
Some parameters are on the meta device because they were offloaded to the cpu.
Map: 10628 examples [00:01, 4235.36 examples/s]        


In [4]:
state = env.reset()
system_message = {"role": "system", "content": state.system_prompt}
messages = hf_tokenizer.apply_chat_template(
    [system_message] + state.new_messages,
    tokenize=False,
    tools=state.tools,
)
rich_print_messages(messages)

### Warm-up: HotpotQA

In [40]:
from omegaconf import OmegaConf

from act_prm.environments import get_env

env_config = OmegaConf.load("../configs/environments/hotpotqa_mc/default.yaml")
env = get_env(**env_config)

Loading checkpoint shards: 100%|██████████| 5/5 [00:00<00:00, 119.59it/s]
Map: 10628 examples [00:01, 4147.34 examples/s]        


In [44]:
env.tool_registry["visit"].get_tool_desc()

{'type': 'function',
 'name': 'visit',
 'description': 'Visit a given title and expand for more information.',
 'parameters': {'type': 'object',
  'properties': {'title': {'type': 'string',
    'description': 'The title to visit'}}},
 'required': ['title']}

In [6]:
from act_prm.environments.hotpotqa_mc.prompts import FEWSHOT_PROMPTS

sample_messages = FEWSHOT_PROMPTS[0]
sample_messages

[{'role': 'user',
  'content': "## Instruction\nGiven a list of titles, think and call tools to answer this question:\n'''\nWhich documentary is about Finnish rock groups, Adam Clayton Powell or The Saimaa Gesture?\n'''\n\nYou may only visit the titles provided. Only call the `visit` tool once per turn.\n\nYour final answer should be a concise sentence, in the following format: 'Final Answer: <put your answer here>'.\n\n## Tool Calling\nYou can only search the following titles:\n\n- 'Adam (musical)'\n- 'Adam Clayton Powell (film)'\n- 'Adam Clayton Powell Jr.'\n- 'Adam Clayton Powell IV'\n- 'Seventh Avenue (Manhattan)'\n- 'Mother African Methodist Episcopal Zion Church'\n- 'Abyssinian Baptist Church'\n- 'Adam Clayton Powell Jr. State Office Building'\n- 'The Saimaa Gesture'\n- 'Adam Clayton Powell Sr.'\n\n## Instruction (again)\nNow answer the original question. Recall the question is:\n'''\nWhich documentary is about Finnish rock groups, Adam Clayton Powell or The Saimaa Gesture?\n'''\

In [7]:
sample_messages = FEWSHOT_PROMPTS[0]

system_message = {"role": "system", "content": env.system_prompt}
prompt_message, sample_messages = sample_messages[0], sample_messages[1:]

tools = [env.tool_registry["visit"].get_tool_desc()]
messages = hf_tokenizer.apply_chat_template(
    [system_message] + [prompt_message] + sample_messages,
    tokenize=False,
    tools=tools,
)
rich_print_messages(messages)

In [95]:
sample_messages[:-2]

[{'role': 'assistant',
  'content': 'I need to search Nicholas Ray and Elia Kazan, find their professions, then find the profession they have in common.\n\n<tool_call>\n{"name": "visit", "arguments": {"title": "Nicholas Ray"}}\n</tool_call>'},
 {'role': 'tool', 'content': '...'},
 {'role': 'assistant',
  'content': 'Professions of Nicholas Ray are director, screenwriter, and actor. I need to search Elia Kazan next and find his professions.\n\n<tool_call>\n{"name": "visit", "arguments": {"title": "Elia Kazan"}}\n</tool_call>'},
 {'role': 'tool', 'content': '...'}]

In [99]:
sample_messages

[{'role': 'assistant',
  'content': 'I need to search Nicholas Ray and Elia Kazan, find their professions, then find the profession they have in common.\n\n<tool_call>\n{"name": "visit", "arguments": {"title": "Nicholas Ray"}}\n</tool_call>'},
 {'role': 'tool',
  'content': 'Nicholas Ray (born Raymond Nicholas Kienzle Jr., August 7, 1911 – June 16, 1979) was an American film director, screenwriter, and actor best known for the 1955 film Rebel Without a Cause.'},
 {'role': 'assistant',
  'content': 'Professions of Nicholas Ray are director, screenwriter, and actor. I need to search Elia Kazan next and find his professions.\n\n<tool_call>\n{"name": "visit", "arguments": {"title": "Elia Kazan"}}\n</tool_call>'},
 {'role': 'tool',
  'content': 'Elia Kazan was an American film and theatre director, producer, screenwriter and actor.'}]

In [11]:
env.default_context + sample_messages[:1]

[{'role': 'user',
  'content': "## Instruction\nGiven a list of titles, think and call tools to answer this question:\n'''\nWhich documentary is about Finnish rock groups, Adam Clayton Powell or The Saimaa Gesture?\n'''\n\nYou may only visit the titles provided. Only call the `visit` tool once per turn.\n\nYour final answer should be a concise sentence, in the following format: 'Final Answer: <put your answer here>'.\n\n## Tool Calling\nYou can only search the following titles:\n\n- 'Adam (musical)'\n- 'Adam Clayton Powell (film)'\n- 'Adam Clayton Powell Jr.'\n- 'Adam Clayton Powell IV'\n- 'Seventh Avenue (Manhattan)'\n- 'Mother African Methodist Episcopal Zion Church'\n- 'Abyssinian Baptist Church'\n- 'Adam Clayton Powell Jr. State Office Building'\n- 'The Saimaa Gesture'\n- 'Adam Clayton Powell Sr.'\n\n## Instruction (again)\nNow answer the original question. Recall the question is:\n'''\nWhich documentary is about Finnish rock groups, Adam Clayton Powell or The Saimaa Gesture?\n'''\

In [None]:
sample_messages = FEWSHOT_PROMPTS[1]

system_message = {"role": "system", "content": env.system_prompt}
# prompt_message, sample_messages = sample_messages[0], sample_messages[1:]
# sample_messages = [m for m in sample_messages if m["role"] == "assistant"]

def maybe_hide_observations(
    messages: list[dict[str, str]],
    first_obs_to_show: int = 1,  # e.g., to keep prompt
    last_obs_to_show: int = 1,   # e.g., to keep last observation
    hidden_obs_content: str = "...",
) -> list[dict[str, str]]:
    """
    Hide observations from messages
    """
    if len(messages) == 0:
        return messages

    num_messages = len(messages)
    return [
        {"role": message["role"], "content": hidden_obs_content}
        if (
            message["role"] in ["user", "tool"]
            and (idx >= first_obs_to_show and idx < num_messages - last_obs_to_show)
        )
        else message
        for idx, message in enumerate(messages)
    ]

tools = [env.tool_registry["visit"].get_tool_desc()]

messages = env.default_context + sample_messages[:1]
first_obs_to_show = len(messages) + 1

new_messages = maybe_hide_observations(
    [system_message] + env.default_context + sample_messages,
    first_obs_to_show=first_obs_to_show,
)

messages = hf_tokenizer.apply_chat_template(
    # maybe_hide_observations([system_message] + sample_messages, first_obs_to_show=2),
    new_messages,
    tokenize=False,
    tools=tools,
)
rich_print_messages(messages)

In [113]:
messages = hf_tokenizer.apply_chat_template(
    maybe_hide_observations([system_message] + sample_messages[:-2]),
    add_generation_prompt=True,
    tokenize=False,
    tools=tools,

)
rich_print_messages(messages)

In [101]:
state = env.reset()

In [7]:
for k in vars(state).keys():
    print(k)


system_prompt
new_messages
model_response
prior_messages
tools
sample_id
generation_id
batch_id
timestep
try_step
metadata
question
answer
grading_rubric
all_docs_dict
tool_registry


In [14]:
state.system_prompt

'You are a helpful assistant that can answer questions and call tools.'

In [12]:
state.new_messages

[{'role': 'user',
  'content': "## Instruction\nGiven a list of titles, think and call tools to answer this question:\n'''\nHatyapuri was a novel by the filmmaker of what nationality?\n'''\n\nYou may only visit the titles provided. Only call the `visit` tool once per turn.\n\nYour final answer should be a concise sentence, in the following format: 'Final Answer: <put your answer here>'.\n\n## Tool Calling\nYou can only search the following titles:\n\n- 'Carazamba'\n- 'Daniel Kehlmann'\n- 'Edouard de Laurot'\n- 'Fernando Vallejo'\n- 'Hatyapuri'\n- 'John Michael McDonagh'\n- 'Peter Weiss'\n- 'Seduction of the Minotaur'\n- 'The Italian (novel)'\n- 'The Maid of Arran'\n\n## Instruction (again)\nNow answer the original question. Recall the question is:\n'''\nHatyapuri was a novel by the filmmaker of what nationality?\n'''\n\nVERY IMPORTANT: You may only use the provided `visit` tool once per turn, and only use the given titles to answer this question. If you provide a title not in the given

In [19]:
state.tools

[{'type': 'function',
  'name': 'visit',
  'description': 'Visit a given title and expand for more information.',
  'parameters': {'type': 'object',
   'properties': {'title': {'type': 'string',
     'description': 'The title to visit'}}},
  'required': ['title']}]

In [60]:
system_message = {"role": "system", "content": state.system_prompt}
messages = hf_tokenizer.apply_chat_template(
    [system_message] + state.new_messages,
    tokenize=False,
    tools=state.tools,
)
print(messages)

<|im_start|>system
You are a helpful assistant that infers reasoning thoughts behind observed actions.

# Tools

You may call one or more functions to assist with the user query.

You are provided with function signatures within <tools></tools> XML tags:
<tools>
{"description": "Visit a given title and expand for more information.", "name": "visit", "parameters": {"properties": {"title": {"description": "The title to visit", "type": "string"}}, "type": "object"}, "required": ["title"], "type": "function"}
</tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call><|im_end|>
<|im_start|>user
## Instruction
Given a list of titles, think and call tools to answer this question:
'''
Which documentary is about Finnish rock groups, Adam Clayton Powell or The Saimaa Gesture?
'''

You may only visit the titles provided. Only call the `visit` to

### Act-PRM version

In [29]:
env_config = OmegaConf.load("../configs/environments/act_prm/hotpotqa_mc.yaml")
env_config.actions_only = False  # include thoughts and actions in trajectories
env = get_env(**env_config)
state = env.reset()
print(state.new_messages)
print(state.tools)
print(state.system_prompt)


100%|██████████| 1745/1745 [00:00<00:00, 3088.68it/s]

[{'role': 'user', 'content': "## Instruction\nGiven a list of titles, think and call tools to answer this question:\n'''\nWhich documentary is about Finnish rock groups, Adam Clayton Powell or The Saimaa Gesture?\n'''\n\nYou may only visit the titles provided. Only call the `visit` tool once per turn.\n\nYour final answer should be a concise sentence, in the following format: 'Final Answer: <put your answer here>'.\n\n## Tool Calling\nYou can only search the following titles:\n\n- 'Adam (musical)'\n- 'Adam Clayton Powell (film)'\n- 'Adam Clayton Powell Jr.'\n- 'Adam Clayton Powell IV'\n- 'Seventh Avenue (Manhattan)'\n- 'Mother African Methodist Episcopal Zion Church'\n- 'Abyssinian Baptist Church'\n- 'Adam Clayton Powell Jr. State Office Building'\n- 'The Saimaa Gesture'\n- 'Adam Clayton Powell Sr.'\n\n## Instruction (again)\nNow answer the original question. Recall the question is:\n'''\nWhich documentary is about Finnish rock groups, Adam Clayton Powell or The Saimaa Gesture?\n'''\n\




In [30]:
env.datasets["train"][0]

([{'content': "## Instruction\nGiven a list of titles, think and call tools to answer this question:\n'''\nWhich listed Alain Resnais film both credits Giovanni Fusco for music and has a screenplay or source by Marguerite Duras, Alain Robbe‑Grillet, Jean Cayrol, Jean Gruault, Jules Feiffer, or Alan Ayckbourn?\n'''\n\nYou may only visit the titles provided. Only call the `visit` tool once per turn.\n\nYour final answer should be a concise sentence, in the following format: 'Final Answer: <put your answer here>'.\n\n## Tool Calling\nYou can only search the following titles:\n\n- 'ASR Nederland'\n- 'All Cried Out (Alison Moyet song)'\n- 'Dutch East India Company'\n- 'Dutch East Indies'\n- 'F. D. J. Pangemanann'\n- 'Getting into Something'\n- 'Giovanni Fusco'\n- 'Hiroshima mon amour'\n- 'Hsieh Yung-kuan'\n- 'I Want to Go Home (film)'\n- 'It Won't Be Long (Alison Moyet song)'\n- 'Je t'aime, je t'aime'\n- 'Last Year at Marienbad'\n- 'Life Is a Bed of Roses'\n- 'Life of Riley (2014 film)'\n- 

In [34]:
system_message = {"role": "system", "content": env.system_prompt}
sample_messages, sample_tools = env.datasets["train"][3]

messages = hf_tokenizer.apply_chat_template(
    [system_message] + sample_messages,
    tokenize=False,
    tools=sample_tools,
)
print(messages)

<|im_start|>system
You are a helpful assistant that infers reasoning thoughts behind observed actions.

# Tools

You may call one or more functions to assist with the user query.

You are provided with function signatures within <tools></tools> XML tags:
<tools>
{"description": "Visit a given title and expand for more information.", "name": "visit", "parameters": {"properties": {"title": {"description": "The title to visit", "type": "string"}}, "type": "object"}, "required": ["title"], "type": "function"}
</tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call><|im_end|>
<|im_start|>user
## Instruction
Given a list of titles, think and call tools to answer this question:
'''
Which listed pizza brand has the earliest documented founding year?
'''

You may only visit the titles provided. Only call the `visit` tool once per turn.

Your