# Embodied Agent Interface Challenge @ NeurIPS 2025

Welcome to the **Embodied Agent Interface (EAI) Challenge**, a NeurIPS 2025 competition that introduces a unified benchmarking framework for evaluating **Large Language Models (LLMs)** in **embodied decision-making tasks**. This competition aims to foster reproducible research and rigorous analysis in embodied AI, bridging the gap between language modeling and robotic planning.

In this tutorial, we will guide you step-by-step through preparing your submission for the EAI Challenge. To keep participation accessible to the broader embodied AI community, setting up complex simulators or environments is optional, though you are more than welcome to do so if you would like to fine-tune your model for better performance. All you need to do is simply follow the steps outlined here and submit your output files in the required format. Enjoy the journey and have fun!


## Resources
To help you get up to speed and make the most of the **EAI Challenge**, we have prepared a set of essential resources. Feel free to explore them in the following order for the smoothest experience:

- **üìÑ Paper**: [Understanding the EAI Challenge and its Objectives](https://arxiv.org/abs/2410.07166)

- **üìù Tutorial**: [Step-by-step guide to setting up your environment and understanding the challenge](https://github.com/embodied-agent-interface/embodied-agent-interface)

- **üìñ Documentation**: [Complete reference for evaluating and troubleshooting four ability modules](https://embodied-agent-eval.readthedocs.io/)

- **üê≥ Docker Image**: [Prebuilt environment for running your experiments hassle-free](https://hub.docker.com/r/jameskrw/eai-eval)

In [1]:
import os
import json
from tqdm.notebook import tqdm
import time
import torch
import openai
from transformers import AutoModelForCausalLM, AutoTokenizer

## Prompt Structure
All the necessary prompt files are located in the `llm_prompts` directory. There are 8 prompt files in total, each containing a set of prompts for evaluating a specific ability module within a given simulation environment.
Before we dive into the implementation details, it's crucial to understand the structure of the prompts you'll be working with. Each prompt is expected to be a JSON object containing the following fields:

- `identifier`: A unique task identifier for the prompt.
- `llm_prompt`: The actual text prompt to be fed into the language model.

Here's an example of what a prompt might look like:

```json
{
    "identifier": "bringing_in_wood_0_Benevolence_1_int_0_2021-09-15_18-42-25",
    "llm_prompt": "Problem: You are designing instructions for a household robot. The goal is to guide the robot to modify its environment from ..."
}
```

Familiarizing yourself with this structure will help you navigate the prompt files more effectively and ensure that your submissions are correctly formatted.

In [2]:
prompt_dir = "llm_prompts"
prompt_files =[os.path.join(prompt_dir, file) for file in os.listdir(prompt_dir) if file.endswith(".json")]
prompt_files = sorted(prompt_files)
prompt_files

['llm_prompts/behavior_action_sequencing_prompts.json',
 'llm_prompts/behavior_goal_interpretation_prompts.json',
 'llm_prompts/behavior_subgoal_decomposition_prompts.json',
 'llm_prompts/behavior_transition_modeling_prompts.json',
 'llm_prompts/virtualhome_action_sequencing_prompts.json',
 'llm_prompts/virtualhome_goal_interpretation_prompts.json',
 'llm_prompts/virtualhome_subgoal_decomposition_prompts.json',
 'llm_prompts/virtualhome_transition_modeling_prompts.json']

In [3]:
for prompt_file in prompt_files:
    with open(prompt_file, "r") as f:
        prompts = json.load(f)
    print(f"{os.path.basename(prompt_file)} has {len(prompts)} prompts.")

behavior_action_sequencing_prompts.json has 100 prompts.
behavior_goal_interpretation_prompts.json has 100 prompts.
behavior_subgoal_decomposition_prompts.json has 100 prompts.
behavior_transition_modeling_prompts.json has 100 prompts.
virtualhome_action_sequencing_prompts.json has 342 prompts.
virtualhome_goal_interpretation_prompts.json has 342 prompts.
virtualhome_subgoal_decomposition_prompts.json has 338 prompts.
virtualhome_transition_modeling_prompts.json has 296 prompts.


## Submission Preparation


### Approach 1: Using Hugging Face Transformers
If you prefer to use the Hugging Face `transformers` library as evaluation pipelines, here are the key steps to prepare your submission:

1. **Install Dependencies**: Make sure you have the necessary libraries installed. You can do this by running:
   ```bash
   pip install transformers
   ```

2. **Load Model and Tokenizer**: Use the following code to load the model and tokenizer:
   ```python
   from transformers import AutoModelForCausalLM, AutoTokenizer

   model = AutoModelForCausalLM.from_pretrained("your_model_name")
   tokenizer = AutoTokenizer.from_pretrained("your_model_name")
   ```

3. **Generate Outputs**: Use the model to generate outputs for each prompt. Make sure to save the outputs in the specified format.

   - `identifier`: A unique task identifier for the prompt.
   - `llm_output`: The actual text output generated by the language model.

   Here is an example of the expected output format for a corresponding input prompt:

   ```json
   {
      "identifier": "bringing_in_wood_0_Benevolence_1_int_0_2021-09-15_18-42-25",
      "llm_output": "[{\"action\":\"RIGHT_GRASP\",\"object\":\"plywood_0\"},{\"action\":\"RIGHT_PLACE_ONTOP\",\"object\":\"room_floor_kitchen_0\"},       {\"action\":\"LEFT_GRASP\",\"object\":\"plywood_1\"},{\"action\":\"LEFT_PLACE_ONTOP\",\"object\":\"room_floor_kitchen_0\"},{\"action\":\"LEFT_GRASP\",\"object\":\"plywood_2\"},{\"action\":\"LEFT_PLACE_ONTOP\",\"object\":\"room_floor_kitchen_0\"}]"
   }
   ```


4. **Organize Output Files**: Place all generated output files in a dedicated directory, such as `sample_submission`.

Below we use Qwen3-0.6B as an example to demonstrate this process. More details can be found in the [official documentation](https://huggingface.co/Qwen/Qwen3-0.6B).

In [None]:
model_name = "Qwen/Qwen3-14B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    attn_implementation="flash_attention_2",
)

In [5]:
def qwen_gen(model, tokenizer, prompt):
    # prepare the model input
    messages = [
        {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=False # switches between thinking and non-thinking modes. Default is True.
    )
    model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

    # conduct text completion
    generated_ids = model.generate(
        **model_inputs,
        max_new_tokens=40960
    )
    output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

    # parsing thinking content
    try:
        # rindex finding 151668 (</think>)
        index = len(output_ids) - output_ids[::-1].index(151668)
    except ValueError:
        index = 0

    # thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
    content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

    return content

In [None]:
for prompt_file in tqdm(prompt_files, desc="Prompt files", leave=True):
    with open(prompt_file, "r") as f:
        prompts = json.load(f)

    responses = []
    for prompt in tqdm(prompts, desc=f"Processing {os.path.basename(prompt_file)}", leave=True):
        identifier = prompt['identifier']
        prompt_text = prompt['llm_prompt']
        llm_output = qwen_gen(model, tokenizer, prompt_text)
        responses.append({
            "identifier": identifier,
            "llm_output": llm_output
        })

    outputs_save_path = os.path.join("sample_submission", f"{os.path.basename(prompt_file).split('_prompts')[0]}_outputs.json")
    os.makedirs(os.path.dirname(outputs_save_path), exist_ok=True)
    with open(outputs_save_path, "w") as f:
        json.dump(responses, f, indent=4)
# clear the GPU memory
del tokenizer
del model
torch.cuda.empty_cache()
torch.cuda.ipc_collect()

## Approach 2: Access Models through API
If you prefer to use proprietary models through an API, here are the key steps to prepare your submission:

1. **Choose a Model**: Select a proprietary model that suits your needs. Make sure to review the documentation for any specific requirements or limitations.

2. **Set Up API Access**: Follow the provider's instructions to set up API access. This may involve creating an account, obtaining API keys, and installing any necessary libraries.

3. **Generate Outputs**: Use the model to generate outputs for each prompt. Make sure to save the outputs in the specified format.

   - `identifier`: A unique task identifier for the prompt.
   - `llm_output`: The actual text output generated by the language model.

   Here is an example of the expected output format for a corresponding input prompt:

   ```json
   {
      "identifier": "bringing_in_wood_0_Benevolence_1_int_0_2021-09-15_18-42-25",
      "llm_output": "[{\"action\":\"RIGHT_GRASP\",\"object\":\"plywood_0\"},{\"action\":\"RIGHT_PLACE_ONTOP\",\"object\":\"room_floor_kitchen_0\"},       {\"action\":\"LEFT_GRASP\",\"object\":\"plywood_1\"},{\"action\":\"LEFT_PLACE_ONTOP\",\"object\":\"room_floor_kitchen_0\"},{\"action\":\"LEFT_GRASP\",\"object\":\"plywood_2\"},{\"action\":\"LEFT_PLACE_ONTOP\",\"object\":\"room_floor_kitchen_0\"}]"
   }
   ```

4. **Organize Output Files**: Place all generated output files in a dedicated directory, such as `sample_submission`.

In [None]:
# Here we use OpenAI-compatible API as an example
# Depending on your LLM provider, you might need to adjust the client initialization and request format to support concurrent requests
def model_gen(model_name, prompt, max_retries=88):
    client = OpenAI(
        base_url='<Put Your API URL Here>', # e.g. https://openrouter.ai/api/v1
        api_key='<Put Your API KEY Here>'
    )
    current_attempt = 0
    delay = 1
    while current_attempt < max_retries:
        try:
            response = client.chat.completions.create(
                model=model_name,
                messages=[
                    {
                        'role': 'user',
                        'content': prompt
                    }
                ],
                stream=False
            )
            # return response.choices[0].message.reasoning_content, response.choices[0].message.content
            return response.choices[0].message.content
        except Exception as e:
            print(f"An error occurred on attempt {current_attempt + 1}: {e}")
            current_attempt += 1
            if current_attempt < max_retries:
                print(f"Retrying in {delay} seconds...")
                time.sleep(delay)
                delay *= 2
            else:
                print(f"All {max_retries} retry attempts failed for model {model_name}.")
                return None

In [None]:
model_name = "meta-llama/llama-4-maverick"  # Replace with your desired model name
for prompt_file in tqdm(prompt_files, desc="Prompt files", leave=True):
    with open(prompt_file, "r") as f:
        prompts = json.load(f)

    responses = []
    for prompt in tqdm(prompts, desc=f"Processing {os.path.basename(prompt_file)}", leave=True):
        identifier = prompt['identifier']
        prompt_text = prompt['llm_prompt']
        # reasoning, answer = model_gen(model_name, prompt_text)
        answer = model_gen(model_name, prompt_text)
        responses.append({
            "identifier": identifier,
            "llm_output": answer
        })

    outputs_save_path = os.path.join("sample_submission", f"{os.path.basename(prompt_file).split('_prompts')[0]}_outputs.json")
    os.makedirs(os.path.dirname(outputs_save_path), exist_ok=True)
    with open(outputs_save_path, "w") as f:
        json.dump(responses, f, indent=4)

## Submission on EvalAI
Before you submit your outputs, ensure that they are properly organized and meet the submission requirements. Here are some key steps to follow:

1. **Organize Output Files**: Place all generated output files in a dedicated directory, such as `sample_submission`.
2. **Naming Convention**: Ensure that each output file is named according to the corresponding prompt file, with `_outputs` appended before the file extension.
3. **Review File Structure**: Double-check the directory structure to make sure it matches the expected format for submission.
4. **Final Review**: Conduct a final review of all output files to ensure they are correctly formatted and contain the expected data.
5. **Zip the Directory**: Compress the `sample_submission` directory into a zip file for submission.

An example submission structure should look like this:

```
sample_submission.zip/
‚îú‚îÄ‚îÄ behavior_action_sequencing_outputs.json
‚îú‚îÄ‚îÄ behavior_goal_interpretation_outputs.json
‚îú‚îÄ‚îÄ behavior_subgoal_decomposition_outputs.json
‚îú‚îÄ‚îÄ behavior_transition_modeling_outputs.json
‚îú‚îÄ‚îÄ virtualhome_action_sequencing_outputs.json
‚îú‚îÄ‚îÄ virtualhome_goal_interpretation_outputs.json
‚îú‚îÄ‚îÄ virtualhome_subgoal_decomposition_outputs.json
‚îî‚îÄ‚îÄ virtualhome_transition_modeling_outputs.json
```

After making sure the submission structure is correct, you can proceed with the following steps detailed in the [Participate page](https://neurips25-eai.github.io/participate) to complete your submission on EvalAI and track your performance on the [Leaderboard](https://eval.ai/web/challenges/challenge-page/2621/leaderboard).

Note:
The 8 output files in sample_submission folder is generated by the Qwen3-14B model. It will not give you a super high score, but it can serve as a useful reference for your own submissions.