[skyrl] Pass through profiler_config to vLLM engine#1622
[skyrl] Pass through profiler_config to vLLM engine#1622pcmoritz wants to merge 1 commit intoNovaSky-AI:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements profiling support for vLLM inference engines and remote clients, adding start_profile and stop_profile functionality across the stack and improving error reporting in the remote client by capturing response bodies on failure. Feedback indicates that AsyncVLLMInferenceEngine requires a lock to serialize profiling during concurrent generate calls, as vLLM's profiler is global and does not support overlapping spans.
| self._weight_loader = VLLMWeightLoader(self.llm, is_async=True) | ||
| # vLLM raises if profile() is called without profiler_config; gate on it. | ||
| self._profile_enabled = self.llm.vllm_config.profiler_config.profiler is not None | ||
| self._profile_counter = 0 |
There was a problem hiding this comment.
Since AsyncVLLMInferenceEngine is an asynchronous engine, multiple generate calls can run concurrently. Because vLLM's profiler is global, concurrent profiling attempts will conflict. We should initialize a lock here to serialize profiling in the generate method.
| self._profile_counter = 0 | |
| self._profile_counter = 0 | |
| self._profile_lock = asyncio.Lock() |
| if self._profile_enabled: | ||
| await self.llm.start_profile(profile_prefix=f"sample_{self._profile_counter}") | ||
| self._profile_counter += 1 | ||
| try: | ||
| tasks = [] | ||
| for prompt in prompt_token_ids: | ||
| # Schedule the collection of outputs for each prompt. | ||
| # Avoid duplicate request_ids | ||
| request_id = str(uuid4().hex) | ||
| task = asyncio.create_task(self._collect_outputs(prompt, request_id, sampling_params)) | ||
| tasks.append(task) | ||
| outputs = await asyncio.gather(*tasks) | ||
| finally: | ||
| if self._profile_enabled: | ||
| await self.llm.stop_profile() |
There was a problem hiding this comment.
As mentioned in the __init__ comment, vLLM's profiler is global and does not support concurrent profiling spans. If multiple generate calls occur simultaneously while profiling is enabled, they will interfere with each other. This block should be protected by a lock to ensure that only one profiling session is active at a time.
| if self._profile_enabled: | |
| await self.llm.start_profile(profile_prefix=f"sample_{self._profile_counter}") | |
| self._profile_counter += 1 | |
| try: | |
| tasks = [] | |
| for prompt in prompt_token_ids: | |
| # Schedule the collection of outputs for each prompt. | |
| # Avoid duplicate request_ids | |
| request_id = str(uuid4().hex) | |
| task = asyncio.create_task(self._collect_outputs(prompt, request_id, sampling_params)) | |
| tasks.append(task) | |
| outputs = await asyncio.gather(*tasks) | |
| finally: | |
| if self._profile_enabled: | |
| await self.llm.stop_profile() | |
| async def _do_generate(): | |
| tasks = [ | |
| asyncio.create_task(self._collect_outputs(p, str(uuid4().hex), sampling_params)) | |
| for p in prompt_token_ids | |
| ] | |
| return await asyncio.gather(*tasks) | |
| if self._profile_enabled: | |
| async with self._profile_lock: | |
| await self.llm.start_profile(profile_prefix=f"sample_{self._profile_counter}") | |
| self._profile_counter += 1 | |
| try: | |
| outputs = await _do_generate() | |
| finally: | |
| await self.llm.stop_profile() | |
| else: | |
| outputs = await _do_generate() |
Can be used via