Fix some minor issues and provide tests for Pipeline#4365
Fix some minor issues and provide tests for Pipeline#4365lvhan028 merged 4 commits intoInternLM:mainfrom
Conversation
91d7db2 to
fb67ef8
Compare
There was a problem hiding this comment.
Pull request overview
This PR improves robustness and API consistency in LMDeploy’s inference pipeline/engine, and adds an end-to-end test suite intended to cover key Pipeline behaviors and edge cases (notably max_new_tokens=0).
Changes:
- Update
AsyncEngine.generate()to early-exit whenmax_new_tokensresolves to0, and add a defaultEngineOutputfallback when the engine yields no outputs. - Forward
**kwargsfromPipeline.chat()intostream_infer()for better parameter propagation. - Add a new
tests/test_lmdeploy/test_pipeline.pyintegration test suite for Pipeline infer/stream/chat/session/ppl behaviors, includingmax_new_tokens=0.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 9 comments.
| File | Description |
|---|---|
| tests/test_lmdeploy/test_pipeline.py | Adds Pipeline integration tests across backends, streaming/chat flows, and max_new_tokens=0 edge case. |
| lmdeploy/serve/core/async_engine.py | Refactors gen-config determination/early-exit and initializes default EngineOutput to handle empty engine generators. |
| lmdeploy/pipeline.py | Fixes Pipeline.chat() kwarg propagation to stream_infer(). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
553f0aa to
3d325fa
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
lmdeploy/serve/core/async_engine.py
Outdated
| gen_config = self._determine_gen_config(session, input_ids, gen_config=gen_config) | ||
|
|
||
| if gen_config.max_new_tokens == 0: | ||
| logger.error(f'run out of tokens. session={session_id}.') |
There was a problem hiding this comment.
Logging run out of tokens at error level is misleading when the user explicitly sets max_new_tokens=0 (a valid request). Consider lowering the level (info/debug) and/or changing the message to reflect an intentional zero-token generation.
| logger.error(f'run out of tokens. session={session_id}.') | |
| logger.info(f'no tokens requested (max_new_tokens=0). session={session_id}.') |
3d325fa to
86e3be7
Compare
| req_stats = RequestStats(prompt_tokens=input_len) # per-request stats | ||
|
|
||
| # We use this as default outputs in case the async_stream_infer of the Engine yields empty generator. | ||
| outputs = EngineOutput(ResponseType.INTERNAL_ENGINE_ERROR, []) |
There was a problem hiding this comment.
May let us know the case that yields empty generator
There was a problem hiding this comment.
lmdeploy/lmdeploy/turbomind/turbomind.py
Lines 784 to 786 in 59e8944
If we cancel a request before the generation of the first token, I think we will trigger this case.
86e3be7 to
b86d027
Compare
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
fc5f4a7 to
6b3181f
Compare
Based on the commit history and code changes, I'll write a professional PR description in English.
PR Description
Motivation
This PR addresses several minor issues in the LMDeploy inference pipeline and engine to improve robustness, fix edge cases in token generation, and establish comprehensive test coverage for the pipeline functionality.
Modification
1. Fix kwargs forwarding in pipeline chat method
The
chatmethod inPipelineclass was not properly forwarding keyword arguments tostream_infer, which could cause unexpected behavior when users passed custom generation parameters. This fix ensures all kwargs are correctly propagated through the call chain.2. Fix off-by-one error when max_new_tokens=0
Previously, when
max_new_tokenswas set to 0, the engine would still generate 1 extra token due to incorrect boundary checking logic. This PR refactors the token limit validation inAsyncEngineto handle the zero-token case correctly, ensuring immediate termination without producing any output tokens when explicitly requested.3. Provide default engine outputs for empty generators
Added defensive initialization of
EngineOutputwithINTERNAL_ENGINE_ERRORstatus as a fallback whenasync_stream_inferyields an empty generator. This prevents potential unbound variable errors and provides clearer error signaling in edge cases where the engine fails to produce outputs.4. Add comprehensive pipeline test suite
Introduced a new test file
tests/test_lmdeploy/test_pipeline.pywith extensive coverage of:get_ppl)max_new_tokens=0and varying generation configurationsThe test suite uses pytest parametrization to ensure consistent behavior across both backend implementations.