Skip to content

[CI] Fix test_inference_engines_generation after vllm 0.16.0 upgrade; Use the correct GSM8k path for test_generator_multi_turn_gsm8k_router_replay#1339

Merged
SumanthRH merged 2 commits intomainfrom
fix-ci-failures-0317
Mar 18, 2026
Merged

[CI] Fix test_inference_engines_generation after vllm 0.16.0 upgrade; Use the correct GSM8k path for test_generator_multi_turn_gsm8k_router_replay#1339
SumanthRH merged 2 commits intomainfrom
fix-ci-failures-0317

Conversation

@SumanthRH
Copy link
Copy Markdown
Member

@SumanthRH SumanthRH commented Mar 18, 2026

What does this PR do?

Fixes CI failures on main right now: https://github.com/NovaSky-AI/SkyRL/actions/runs/23218100038/job/67484003027

  1. tests/backends/skyrl_train/gpu/gpu_ci/test_skyrl_gym_generator.py::test_generator_multi_turn_gsm8k_router_replay - FileNotFoundError: Unable to find '/mnt/cluster_storage/data/gsm8k/validation.parquet' -> R3 PR: Rollout Routing Replay #1273 added a router replay test but used an incorrect path for the GSM8K dataset on CI
  2. tests/backends/skyrl_train/gpu/gpu_ci/inference_servers/test_weight_sync.py::TestColocatedIpcWeightUpdateFlow::test_update_weights_ipc -> Similar fix as in [CI] Skip FlashRL integration test in CI and fix failing generation test for new inference codepath #1301
=========================== short test summary info ============================
FAILED tests/backends/skyrl_train/gpu/gpu_ci/test_engine_generation.py::test_inference_engines_generation[tp2_pp1_dp2] - ray.exceptions.ActorDiedError: The actor died because of an error raised in its creation task, ray::AsyncVLLMInferenceEngine.__init__() (pid=25165, ip=10.0.25.170, actor_id=7617504c4a769d500d1bc9ef13000000, repr=<skyrl.backends.skyrl_train.inference_engines.vllm.vllm_engine.AsyncVLLMInferenceEngine object at 0x7ea2b748c800>)
  File "/home/ray/anaconda3/lib/python3.12/concurrent/futures/_base.py", line 456, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
           ^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ray/session_2026-03-17_21-54-20_786615_3475/runtime_resources/working_dir_files/s3_anyscale-production-data-cld-hxkifz7xa22mwicp21nzkds1lw_org_xc6lv84h3d7m9dljcc17esfw2i_cld_hxkifz7xa22mwicp21nzkds1lw_runtime_env_packages_pkg_95e2ed2914e100bfad4cccac453e4b5b/skyrl/backends/skyrl_train/inference_engines/vllm/vllm_engine.py", line 343, in __init__
    super().__init__(*args, **kwargs)
  File "/tmp/ray/session_2026-03-17_21-54-20_786615_3475/runtime_resources/working_dir_files/s3_anyscale-production-data-cld-hxkifz7xa22mwicp21nzkds1lw_org_xc6lv84h3d7m9dljcc17esfw2i_cld_hxkifz7xa22mwicp21nzkds1lw_runtime_env_packages_pkg_95e2ed2914e100bfad4cccac453e4b5b/skyrl/backends/skyrl_train/inference_engines/vllm/vllm_engine.py", line 112, in __init__
    self.llm = self._create_engine(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ray/session_2026-03-17_21-54-20_786615_3475/runtime_resources/working_dir_files/s3_anyscale-production-data-cld-hxkifz7xa22mwicp21nzkds1lw_org_xc6lv84h3d7m9dljcc17esfw2i_cld_hxkifz7xa22mwicp21nzkds1lw_runtime_env_packages_pkg_95e2ed2914e100bfad4cccac453e4b5b/skyrl/backends/skyrl_train/inference_engines/vllm/vllm_engine.py", line 364, in _create_engine
    engine = vllm.AsyncLLMEngine.from_engine_args(engine_args, stat_loggers=stat_loggers)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/.cache/uv/builds-v0/.tmpuPz58u/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 251, in from_engine_args
    return cls(
           ^^^^
  File "/home/ray/.cache/uv/builds-v0/.tmpuPz58u/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 148, in __init__
    self.engine_core = EngineCoreClient.make_async_mp_client(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/.cache/uv/builds-v0/.tmpuPz58u/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 121, in make_async_mp_client
    return DPAsyncMPClient(*client_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/.cache/uv/builds-v0/.tmpuPz58u/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 1082, in __init__
    self._ensure_stats_update_task()
  File "/home/ray/.cache/uv/builds-v0/.tmpuPz58u/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 1091, in _ensure_stats_update_task
    assert self.stats_update_address is not None
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError
FAILED tests/backends/skyrl_train/gpu/gpu_ci/test_skyrl_gym_generator.py::test_generator_multi_turn_gsm8k_router_replay - FileNotFoundError: Unable to find '/mnt/cluster_storage/data/gsm8k/validation.parquet'
ERROR tests/backends/skyrl_train/gpu/gpu_ci/inference_servers/test_weight_sync.py::TestColocatedIpcWeightUpdateFlow::test_update_weights_ipc - AssertionError

Open with Devin

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
@SumanthRH SumanthRH marked this pull request as ready for review March 18, 2026 01:23
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses two CI failures. The first is a FileNotFoundError which is resolved by correcting a dataset path. The second failure, an AssertionError following a vllm upgrade, is fixed by parameterizing a test to include a MoE model, which likely covers the failing code path. The changes are well-targeted and correct. I have one minor suggestion to improve code clarity by removing a redundant parameter.

max_input_length=max_input_length,
max_generate_length=1000,
data_path=os.path.expanduser("/mnt/cluster_storage/data/gsm8k/validation.parquet"),
data_path=os.path.expanduser("~/data/gsm8k/validation.parquet"),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This line explicitly sets data_path to the same value as its default in the run_generator_end_to_end function signature. To reduce redundancy and improve maintainability, you can remove this line and rely on the default value.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 2 additional findings.

Open in Devin Review

@SumanthRH SumanthRH merged commit be602d8 into main Mar 18, 2026
5 of 6 checks passed
devpatelio pushed a commit that referenced this pull request Mar 20, 2026
…e; Use the correct GSM8k path for `test_generator_multi_turn_gsm8k_router_replay` (#1339)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant