Skip to content

The Habitat simulator stops after evaluating a few episodes #36

@haliphinx

Description

@haliphinx

Hello Authors, thanks for open-source the execllent work. I encountered a problem during evaluating the model on R2R val_unseen.

I setup the envirment follows the instruction, and uses the same evaluation scripts/configs as provided. The evaluation processes well at the begining, but would stop after evaluating 1 or two episodes randomlly.

After checking the output, I found the problem might come from the habitat simulator, which is not from this repo. However, I'm still wondering if you encountered similar problems before. The printed log contains the error messages are as follow:

......
↑↑↑↑<|im_end|>
[18:25:50.242867] actions [1, 1, 1, 1]
[18:26:10.990485] <|im_start|>assistant
↑↑↑←<|im_end|>
[18:26:10.990602] actions [1, 1, 1, 2]
[18:26:31.169344] 64 You are an autonomous navigation assistant. Your task is to Move forward to the doorway on the opposite side of the hall.  Stop in the archway.  Devise an action sequence to follow the instruction using the four actions: TURN LEFT (←) or TURN RIGHT (→) by 15 degrees, MOVE FORWARD (↑) by 25 centimeters, or STOP. These are your historical observations <memory>.
[18:26:31.730885] <|im_start|>assistant
↑STOP<|im_end|>
[18:26:31.730971] actions [1, 0]
Fatal Python error: Aborted

Thread 0x00007f4d8481b640 (most recent call first):
  File ".../miniconda3/envs/streamvln/lib/python3.9/threading.py", line 316 in wait
  File ".../miniconda3/envs/streamvln/lib/python3.9/threading.py", line 581 in wait
  File ".../miniconda3/envs/streamvln/lib/python3.9/site-packages/tqdm/_monitor.py", line 60 in run
  File ".../miniconda3/envs/streamvln/lib/python3.9/threading.py", line 980 in _bootstrap_inner
  File ".../miniconda3/envs/streamvln/lib/python3.9/threading.py", line 937 in _bootstrap

Thread 0x00007f4d8781c640 (most recent call first):
  File ".../miniconda3/envs/streamvln/lib/python3.9/threading.py", line 316 in wait
  File ".../miniconda3/envs/streamvln/lib/python3.9/threading.py", line 581 in wait
  File ".../miniconda3/envs/streamvln/lib/python3.9/site-packages/tqdm/_monitor.py", line 60 in run
  File ".../miniconda3/envs/streamvln/lib/python3.9/threading.py", line 980 in _bootstrap_inner
  File ".../miniconda3/envs/streamvln/lib/python3.9/threading.py", line 937 in _bootstrap

Current thread 0x00007f58ed2c2740 (most recent call first):
  File ".../miniconda3/envs/streamvln/lib/python3.9/site-packages/habitat_sim-0.2.4-py3.9-linux-x86_64.egg/habitat_sim/simulator.py", line 780 in get_observation
  File ".../miniconda3/envs/streamvln/lib/python3.9/site-packages/habitat_sim-0.2.4-py3.9-linux-x86_64.egg/habitat_sim/simulator.py", line 458 in get_sensor_observations
  File ".../miniconda3/envs/streamvln/lib/python3.9/site-packages/habitat_sim-0.2.4-py3.9-linux-x86_64.egg/habitat_sim/simulator.py", line 522 in step
  File ".../VLN_workspace/StreamVLN/habitat-lab/habitat-lab/habitat/sims/habitat_simulator/habitat_simulator.py", line 418 in step
  File ".../VLN_workspace/StreamVLN/habitat-lab/habitat-lab/habitat/tasks/nav/nav.py", line 1051 in step
  File ".../VLN_workspace/StreamVLN/habitat-lab/habitat-lab/habitat/core/embodied_task.py", line 311 in _step_single_action
  File ".../VLN_workspace/StreamVLN/habitat-lab/habitat-lab/habitat/core/embodied_task.py", line 333 in step
  File ".../VLN_workspace/StreamVLN/habitat-lab/habitat-lab/habitat/core/env.py", line 309 in step
  File ".../VLN_workspace/StreamVLN/streamvln/streamvln_eval.py", line 344 in eval_action
  File ".../VLN_workspace/StreamVLN/streamvln/streamvln_eval.py", line 553 in evaluate
  File ".../VLN_workspace/StreamVLN/streamvln/streamvln_eval.py", line 534 in eval
  File ".../VLN_workspace/StreamVLN/streamvln/streamvln_eval.py", line 584 in <module>
[2025-08-11 18:26:52,339] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 0 (pid: 649640) of binary: ...//miniconda3/envs/streamvln/bin/python3.9
Traceback (most recent call last):
  File "...//miniconda3/envs/streamvln/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "...//miniconda3/envs/streamvln/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "...//miniconda3/envs/streamvln/lib/python3.9/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "...//miniconda3/envs/streamvln/lib/python3.9/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File ".../miniconda3/envs/streamvln/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File .../miniconda3/envs/streamvln/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
streamvln/streamvln_eval.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-08-11_18:26:52
  host      : cudo-gpu-ai-2-cluster-4763e7cd
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 649640)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 649640
=======================================================

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions