-
Notifications
You must be signed in to change notification settings - Fork 2k
[None][refactor] simplify get_stats and get_kvcache_events with rpc #9980
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
/bot run --disable-fail-fast |
📝 WalkthroughWalkthroughThis change migrates statistics and KV cache events retrieval from IPC-based continuous streaming to RPC-based on-demand fetching. The proxy layer generates RPC addresses and creates clients; the worker acts as an RPC server and serializes stats/events; dispatch threads for IPC are removed; tests shift to polling patterns. Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant Proxy as GenerationExecutorProxy
participant RpcClient as RPCClient
participant Worker as GenerationExecutorWorker
participant RpcServer as RPC Server
rect rgb(200, 220, 240)
note over Proxy,RpcServer: Old Flow (IPC-based Streaming)
Proxy->>Proxy: Create IPC queues<br/>(mp_stats_queue, kv_events_queue)
Proxy->>Proxy: Start dispatch threads<br/>(stats, kv_events)
Worker->>Proxy: Stream stats/events continuously<br/>via IPC queues
Proxy->>Proxy: Dispatch threads receive & buffer
User->>Proxy: get_stats() / get_kv_events()
Proxy->>Proxy: Return buffered results
end
rect rgb(220, 240, 200)
note over Proxy,RpcServer: New Flow (RPC-based On-Demand)
Proxy->>Proxy: Generate unique RPC address & HMAC key
Proxy->>Worker: Pass rpc_addr, hmac_key during init
Worker->>RpcServer: start_rpc_server(rpc_addr, hmac_key)
RpcServer-->>Worker: RPC server running
Proxy->>RpcClient: Create RPCClient(rpc_addr, hmac_key)
RpcClient-->>Proxy: Connected
User->>Proxy: get_stats(timeout) / aget_stats(timeout)
Proxy->>RpcClient: Call fetch_stats_async()
RpcClient->>RpcServer: RPC request to Worker
Worker->>Worker: Serialize stats via _stats_serializer
Worker-->>RpcServer: Return serialized stats
RpcServer-->>RpcClient: RPC response
RpcClient-->>Proxy: Return List[dict]
Proxy-->>User: Return stats/events
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Areas requiring extra attention:
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (11)
tests/unittest/llmapi/test_llm_pytorch.py (3)
1010-1040: Duplicate import inside async test function.Line 1014 imports
jsoninside the test function, butjsonis already imported at the module level (line 1). Remove the redundant import.@skip_ray @pytest.mark.asyncio async def test_llm_rpc_get_stats_async(): """Test that get_stats_async works with RPC orchestrator.""" - import json - with LLM(model=llama_model_path,
1082-1095: Fix implicit Optional and consider extracting helper to reduce duplication.
- Per PEP 484,
expected_len: int = Noneshould beexpected_len: Optional[int] = None.- This helper is duplicated at line 1173. Consider extracting it to module level or a shared utility.
Based on static analysis hint (RUF013).
- def get_stats_with_wait(llm, timeout: float = 2, expected_len: int = None): + def get_stats_with_wait(llm, timeout: float = 2, expected_len: Optional[int] = None):
1173-1186: Duplicate helper function - same issues as line 1082.This is an exact duplicate of the helper defined at lines 1082-1095. Apply the same
Optional[int]fix. Consider refactoring both to a shared module-level helper to reduce duplication.Based on static analysis hint (RUF013).
- def get_stats_with_wait(llm, timeout: float = 2, expected_len: int = None): + def get_stats_with_wait(llm, timeout: float = 2, expected_len: Optional[int] = None):tests/unittest/llmapi/test_llm.py (2)
2178-2191: Polling helper implementation looks correct.The helper properly implements timeout-based polling for stats retrieval with JSON parsing. However, this same helper is duplicated in
test_llm_pytorch.pyand again at line 2267. Consider extracting to a shared test utility module.
2267-2280: Duplicate helper function.This is identical to the helper at lines 2178-2191. Consider extracting to a shared location to reduce maintenance burden.
tests/unittest/llmapi/test_executor.py (1)
230-242: Movetimeimport to module level and verify assertion stability.
- The
import timeat line 231 should be moved to the module level for consistency and performance.- The assertion
stats["iter"] == 1at line 242 might be fragile depending on timing. Consider using>= 1for robustness, similar to the async test above which uses>= 0.+import time # Add at module level (near line 4) ... # Poll for stats since RPC calls return immediately - import time stats_list = [] for _ in range(10): stats_list = executor.get_stats(timeout=0.5) if stats_list: break time.sleep(0.1) assert len(stats_list) > 0 stats = json.loads(stats_list[0]) if isinstance(stats_list[0], str) else stats_list[0] - assert stats["iter"] == 1 + assert stats["iter"] >= 1tensorrt_llm/executor/proxy.py (2)
364-383: Consider narrowing exception type or documenting rationale.The broad
Exceptioncatch is flagged by static analysis (BLE001). While catching broad exceptions may be intentional for RPC robustness (network errors, timeouts, etc.), consider either:
- Catching more specific exceptions (e.g.,
(TimeoutError, ConnectionError, RuntimeError))- Adding a brief comment explaining why broad exception handling is needed
The debug-level logging is appropriate for transient RPC failures.
try: # Stats are already serialized by the worker's fetch_stats_async() stats = self.rpc_client.fetch_stats_async().remote(timeout=timeout) return stats - except Exception as e: + except Exception as e: # Broad catch for RPC robustness (network, timeout, etc.) logger.debug(f"Error fetching stats via RPC: {e}") return []
411-431: Same exception handling pattern as get_stats.Consider applying the same comment or narrowing suggestion as for
get_statsabove for consistency.tensorrt_llm/executor/rpc_proxy.py (2)
101-125: Consider documenting the synchronous fetch behavior in async method.The
aget_statsmethod performs a synchronousget_stats()call (line 120) and then populates the queue for async iteration. This differs from true asynchronous streaming where data arrives incrementally. While this may be intentional for the RPC-based on-demand fetch model, consider documenting this behavior in the docstring to set correct expectations for callers.
145-167: Consider documenting the synchronous fetch behavior in async method.Similar to
aget_stats, theaget_kv_eventsmethod performs a synchronousget_kv_events()call (line 162) and then populates the queue. Consider documenting this batch-fetch-then-stream behavior in the docstring.tensorrt_llm/executor/rpc_worker_mixin.py (1)
112-120: Remove or document unused timeout parameter.The
timeoutparameter is declared but not used in eitherfetch_stats_asyncorfetch_kv_cache_events_async. If it's intended for future use or API consistency withfetch_responses_async, add a docstring note. Otherwise, remove it to avoid confusion.Apply this diff if removing:
- async def fetch_stats_async(self, timeout: Optional[float] = None) -> list: + async def fetch_stats_async(self) -> list: """Async version of fetch_stats using asyncio.to_thread. This method is exposed via RPC and can be called directly by the proxy. Returns serialized stats (JSON strings) that can be sent over RPC. """ stats = await asyncio.to_thread(self.fetch_stats) # Serialize stats before sending over RPC (IterationStats objects are not picklable) return [self._stats_serializer(s) for s in stats] - async def fetch_kv_cache_events_async(self, timeout: Optional[float] = None) -> list: + async def fetch_kv_cache_events_async(self) -> list: """Async version of fetch_kv_cache_events using asyncio.to_thread. This method is exposed via RPC and can be called directly by the proxy. Returns serialized events (JSON strings) that can be sent over RPC. """ events = await asyncio.to_thread(self.fetch_kv_cache_events) # Serialize events before sending over RPC return [self._kv_cache_events_serializer(e) for e in events]
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
tensorrt_llm/executor/proxy.py(7 hunks)tensorrt_llm/executor/rpc_proxy.py(2 hunks)tensorrt_llm/executor/rpc_proxy_mixin.py(3 hunks)tensorrt_llm/executor/rpc_worker_mixin.py(1 hunks)tensorrt_llm/executor/worker.py(9 hunks)tests/unittest/llmapi/test_executor.py(1 hunks)tests/unittest/llmapi/test_llm.py(5 hunks)tests/unittest/llmapi/test_llm_pytorch.py(7 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used (e.g., usefrom package.subpackage import fooand thenfoo.SomeClass()instead offrom package.subpackage.foo import SomeClass)
Python filenames should use snake_case (e.g.,some_file.py)
Python class names should use PascalCase (e.g.,class SomeClass)
Python function and method names should use snake_case (e.g.,def my_awesome_function():)
Python local variable names should use snake_case, with prefixkfor variable names that start with a number (e.g.,k_99th_percentile = ...)
Python global variables should use upper snake_case with prefixG(e.g.,G_MY_GLOBAL = ...)
Python constants should use upper snake_case (e.g.,MY_CONSTANT = ...)
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description (e.g.,self.x = 5followed by"""<type>: Description of 'x'""")
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except clause to the smallest set of specific errors possible instead of catching all exceptions
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible and use the else block to implement the logic
Files:
tests/unittest/llmapi/test_llm.pytensorrt_llm/executor/rpc_proxy_mixin.pytensorrt_llm/executor/proxy.pytensorrt_llm/executor/rpc_worker_mixin.pytensorrt_llm/executor/rpc_proxy.pytensorrt_llm/executor/worker.pytests/unittest/llmapi/test_llm_pytorch.pytests/unittest/llmapi/test_executor.py
**/*.{cpp,h,cu,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code files should contain an NVIDIA copyright header that includes the current year at the top
Files:
tests/unittest/llmapi/test_llm.pytensorrt_llm/executor/rpc_proxy_mixin.pytensorrt_llm/executor/proxy.pytensorrt_llm/executor/rpc_worker_mixin.pytensorrt_llm/executor/rpc_proxy.pytensorrt_llm/executor/worker.pytests/unittest/llmapi/test_llm_pytorch.pytests/unittest/llmapi/test_executor.py
🧠 Learnings (2)
📚 Learning: 2025-07-17T09:01:27.402Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Applied to files:
tensorrt_llm/executor/rpc_worker_mixin.pytensorrt_llm/executor/worker.py
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/unittest/llmapi/test_llm_pytorch.py
🧬 Code graph analysis (8)
tests/unittest/llmapi/test_llm.py (3)
tensorrt_llm/llmapi/llm.py (3)
LLM(1139-1155)get_stats(522-533)generate(262-344)tensorrt_llm/executor/proxy.py (1)
get_stats(364-383)tensorrt_llm/executor/rpc_proxy.py (1)
get_stats(84-99)
tensorrt_llm/executor/rpc_proxy_mixin.py (1)
tensorrt_llm/llmapi/utils.py (1)
_SyncQueue(450-517)
tensorrt_llm/executor/proxy.py (4)
tensorrt_llm/executor/rpc/rpc_client.py (1)
RPCClient(80-720)tensorrt_llm/executor/rpc/rpc_common.py (1)
get_unique_ipc_addr(9-16)tensorrt_llm/llmapi/llm.py (1)
get_stats(522-533)tensorrt_llm/executor/result.py (2)
IterationResult(824-874)set_timeout(840-841)
tensorrt_llm/executor/rpc_worker_mixin.py (1)
tensorrt_llm/executor/base_worker.py (4)
fetch_stats(299-306)_stats_serializer(652-664)fetch_kv_cache_events(308-312)_kv_cache_events_serializer(668-670)
tensorrt_llm/executor/rpc_proxy.py (4)
tensorrt_llm/executor/result.py (3)
result(750-761)IterationResult(824-874)set_timeout(840-841)tensorrt_llm/llmapi/llm.py (1)
get_stats(522-533)examples/apps/fastapi_server.py (1)
stats(47-49)tensorrt_llm/executor/rpc_worker_mixin.py (2)
fetch_stats_async(102-110)fetch_kv_cache_events_async(112-120)
tensorrt_llm/executor/worker.py (2)
tensorrt_llm/executor/base_worker.py (2)
start(364-366)_set_iteration_result_queue(324-330)tensorrt_llm/executor/rpc_worker_mixin.py (2)
RpcWorkerMixin(12-124)start_rpc_server(41-47)
tests/unittest/llmapi/test_llm_pytorch.py (2)
tensorrt_llm/executor/rpc_proxy.py (2)
GenerationExecutorRpcProxy(15-236)get_stats(84-99)tests/unittest/llmapi/test_llm.py (2)
get_stats_with_wait(2178-2191)get_stats_with_wait(2267-2280)
tests/unittest/llmapi/test_executor.py (4)
tensorrt_llm/executor/proxy.py (2)
aget_stats(385-409)get_stats(364-383)tensorrt_llm/executor/rpc_proxy.py (2)
aget_stats(101-125)get_stats(84-99)examples/apps/fastapi_server.py (1)
stats(47-49)tensorrt_llm/llmapi/utils.py (2)
AsyncQueue(363-447)EventLoopShutdownError(369-370)
🪛 Ruff (0.14.8)
tensorrt_llm/executor/proxy.py
380-380: Consider moving this statement to an else block
(TRY300)
381-381: Do not catch blind exception: Exception
(BLE001)
428-428: Consider moving this statement to an else block
(TRY300)
429-429: Do not catch blind exception: Exception
(BLE001)
tensorrt_llm/executor/rpc_worker_mixin.py
112-112: Unused method argument: timeout
(ARG002)
tensorrt_llm/executor/rpc_proxy.py
96-96: Consider moving this statement to an else block
(TRY300)
97-97: Do not catch blind exception: Exception
(BLE001)
140-140: Consider moving this statement to an else block
(TRY300)
141-141: Do not catch blind exception: Exception
(BLE001)
tests/unittest/llmapi/test_llm_pytorch.py
1082-1082: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
1173-1173: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (25)
tensorrt_llm/executor/rpc_proxy_mixin.py (3)
9-9: LGTM - Import aligned with usage.The
_SyncQueueimport is correctly used inhandle_responsesmethod for queue type checking and batch notification.
49-53: Good documentation of architectural change.The docstring clearly explains that stats and kv_events are now fetched on-demand via direct RPC calls rather than streaming loops, which helps future maintainers understand the design decision.
168-170: Helpful contextual comment for removed functionality.The NOTE comment provides good context explaining why the streaming loops were removed and where the functionality now resides.
tests/unittest/llmapi/test_llm_pytorch.py (2)
1-1: LGTM - Import added for JSON parsing.The
jsonimport is correctly added for parsing stats JSON strings in the new RPC stats tests.
980-1007: Well-structured test for synchronous RPC stats retrieval.The test properly validates RPC-based stats retrieval with appropriate polling, assertions for required fields, and JSON parsing. Good coverage for the new
get_statsAPI.tests/unittest/llmapi/test_llm.py (4)
2211-2211: LGTM - Updated to use polling helper.Correctly replaced direct
get_statscall with the polling helper to accommodate RPC-based retrieval semantics.
2221-2221: LGTM - Polling with short timeout for negative test.Appropriately uses a short timeout (0.5s) to verify no additional stats are available.
2225-2225: LGTM - Verifies stats availability after generation.Correctly tests that stats become available after a new generation completes.
2289-2289: LGTM - Updated to use polling helper.Correctly uses the polling helper for stats retrieval in the queued stats test.
tests/unittest/llmapi/test_executor.py (1)
216-224: LGTM - Async stats retrieval updated correctly.The test properly uses the new
aget_stats(timeout)API, iterates over the async iterable, validates stats fields, and breaks after the first result. Good handling of the IterationResult pattern.tensorrt_llm/executor/proxy.py (8)
3-3: LGTM - New imports for RPC functionality.The imports are correctly added for RPC client, unique IPC address generation, and os module (for
os.urandom).Also applies to: 25-29
94-103: LGTM - RPC address and key initialization.Properly generates a unique IPC address and cryptographically secure HMAC key for RPC communication. Passing these to worker_kwargs ensures the worker can start the RPC server at the correct address.
109-113: LGTM - RPC client creation after worker startup.The RPC client is correctly created after workers are started, ensuring the RPC server is ready before the client attempts to connect.
141-147: LGTM - Stats and KV events queues disabled.Setting
stats_queue_addrandkv_cache_events_queue_addrtoNonecorrectly disables the IPC-based stats and KV events channels in favor of RPC-based retrieval.
211-212: Helpful comment documenting removed functionality.Good documentation explaining why the dispatch tasks were removed.
318-321: LGTM - Proper RPC client cleanup on shutdown.The RPC client is properly closed and set to None during shutdown, ensuring resources are released.
385-409: LGTM - Async stats retrieval implementation.The
aget_statsmethod correctly:
- Initializes iteration results if needed
- Returns an empty async iterable if stats result is not available
- Fetches stats via RPC and populates the queue
- Sets the timeout on the result
Good defensive programming with the None check.
433-455: LGTM - Async KV events retrieval implementation.Mirrors the
aget_statsimplementation correctly for KV events. Consistent pattern and proper handling.tensorrt_llm/executor/worker.py (4)
6-6: LGTM! Clean integration of RPC mixin.The import additions and class hierarchy changes appropriately integrate RPC functionality. The method resolution order (MRO) with
RpcWorkerMixinbeforeBaseWorkerensures RPC-specific methods are prioritized correctly.Also applies to: 19-19, 26-27, 34-34
82-84: LGTM! Comment accurately reflects the architectural change.The updated comment correctly documents that stats and KV events are now fetched on-demand via RPC rather than through continuous dispatch threads.
152-153: LGTM! Solid conditional logic for hybrid IPC/RPC support.The changes properly support both IPC-based and RPC-based operation paths:
- Parameters are correctly propagated through
worker_mainto the worker constructor- IPC queues for stats and KV events are created only when addresses are provided
- Conditional signaling prevents errors when queues don't exist
- Iteration result queues are initialized only for IPC paths
This allows the system to operate in RPC-only mode when IPC addresses are not provided.
Also applies to: 201-214, 241-245, 291-293, 319-325
46-47: Rank initialization is properly ordered. Thesuper().__init__()call at line 49 completes beforestart_rpc_server()is invoked at line 67. SinceBaseWorker.__init__()initializesself.rankviampi_rank()before returning,self.rankis safely available whenstart_rpc_server()checksif self.rank == 0. The partial RPC initialization pattern is safe.tensorrt_llm/executor/rpc_proxy.py (1)
2-2: LGTM! Clean import additions and accurate docstring update.The imports support the new RPC-based stats/events retrieval functionality, and the docstring correctly documents the architectural shift from streaming loops to on-demand RPC calls.
Also applies to: 5-5, 9-9, 73-77
tensorrt_llm/executor/rpc_worker_mixin.py (2)
102-110: LGTM! Proper serialization for RPC transmission.The method now correctly serializes
IterationStatsobjects into JSON strings before RPC transmission, as these objects are not picklable. The docstring clearly documents this RPC-exposed behavior.
122-124: LGTM! Clear documentation of architectural change.The note effectively documents the removal of continuous streaming loops in favor of on-demand RPC fetching, helping future maintainers understand the design decision.
ce8dec6 to
455f758
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #28239 [ run ] triggered by Bot. Commit: |
|
PR_Github #28240 [ run ] triggered by Bot. Commit: |
455f758 to
44e2d7f
Compare
|
/bot run --disable-fail-fast |
44e2d7f to
98996c7
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #28241 [ run ] triggered by Bot. Commit: |
|
PR_Github #28242 [ run ] triggered by Bot. Commit: |
|
PR_Github #28242 [ run ] completed with state |
98996c7 to
c0b7a3e
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #28251 [ run ] triggered by Bot. Commit: |
|
PR_Github #28251 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #28401 [ run ] triggered by Bot. Commit: |
|
PR_Github #28401 [ run ] completed with state |
9cdcd05 to
3792aa0
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #28556 [ run ] triggered by Bot. Commit: |
3792aa0 to
a9c5afc
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #28558 [ run ] triggered by Bot. Commit: |
|
PR_Github #29233 [ ] completed with state |
|
PR_Github #29232 [ run ] completed with state
|
|
/bot run |
|
PR_Github #29236 [ run ] triggered by Bot. Commit: |
|
PR_Github #29236 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #29239 [ run ] triggered by Bot. Commit: |
|
PR_Github #29239 [ run ] completed with state
|
Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>
Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>
Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>
9f4adb8 to
dcf449b
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #29272 [ run ] triggered by Bot. Commit: |
|
PR_Github #29272 [ run ] completed with state
|
|
/bot run |
|
PR_Github #29330 [ run ] triggered by Bot. Commit: |
|
PR_Github #29330 [ run ] completed with state
|
|
/bot run |
|
PR_Github #29377 [ run ] triggered by Bot. Commit: |
|
PR_Github #29377 [ run ] completed with state |
…VIDIA#9980) Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com> Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
…VIDIA#9980) Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com> Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
Summary by CodeRabbit
Release Notes
New Features
get_stats,get_kv_events) and asynchronous (aget_stats,aget_kv_events) interfaces.Refactor
✏️ Tip: You can customize this high-level summary in your review settings.
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.