feat(misc): Profiler support #1121

WuSiYu · 2025-11-20T17:34:13Z

use --enable_profiling=MODE to enable, currently support torch_profile and nvtx (use with NVIDIA Nsight system) mode

gemini-code-assist · 2025-11-20T17:34:35Z

Summary of Changes

Hello @WuSiYu, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances LightLLM's observability by adding comprehensive profiling capabilities. It allows developers to easily enable and control performance profiling across the entire distributed system, supporting both PyTorch's native profiler and NVIDIA NVTX. This feature is crucial for identifying performance bottlenecks and optimizing the execution flow of the LLM inference server.

Highlights

New Profiling Module: Introduced a new lightllm/utils/profiler.py module containing ProfilerCmd and ProcessProfiler classes to centralize profiling logic for torch_profiler and nvtx.
Command-Line Argument for Profiling: Added a --enable_profiling command-line argument to api_cli.py allowing users to specify torch_profiler or nvtx modes for system-wide profiling.
HTTP API for Dynamic Profiling Control: Implemented new HTTP GET endpoints, /profiler_start and /profiler_stop, in api_http.py to enable dynamic control over profiling sessions during runtime.
Distributed Profiling Command Propagation: Integrated profiling command propagation across various LightLLM components, including the HTTP server manager, router manager, visual server manager, and visual model RPCs, ensuring profiling commands are distributed throughout the system.
Profiler Integration in Core Components: Initialized ProcessProfiler instances within the router, visual server, and model backend components, enabling them to respond to profiling commands and manage their respective profiling activities.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces profiler support, which is a great feature for debugging and performance analysis. The changes span across the API layer, server managers, and model backends to enable torch_profiler and nvtx profiling modes. The implementation is comprehensive, adding new CLI arguments, API endpoints, and internal command propagation.

My review focuses on ensuring correctness and maintainability. I've identified a potential race condition in the model backend, a few typos in the CLI arguments and help text that could affect functionality and clarity, and some minor inconsistencies in API responses and type hints. I've also pointed out a busy-wait loop that could be optimized. Addressing these points will make the new profiler feature more robust and easier to maintain.

gemini-code-assist · 2025-11-20T17:36:26Z

lightllm/server/router/model_infer/mode_backend/base_backend.py

+                mode=self.args.enable_profiling,
+                name=f"lightllm-model_backend-node{self.node_rank}_dev{get_current_device_id()}",
+            )
+            self.profiling_active = False


The self.profiling_active attribute is accessed by multiple threads (infer_loop_thread and infer_loop_thread1) without a lock, which can lead to a race condition. One thread might read an outdated value while another is writing to it. You should introduce a threading.Lock to protect access to this attribute.

Initialize it in init_model:

self.profiling_active = False self.profiling_lock = threading.Lock()

Then use it when accessing self.profiling_active in _try_read_new_reqs and _read_reqs_buffer_and_init_reqs:

In _try_read_new_reqs:

with self.profiling_lock: if self.profiler.is_active != self.profiling_active: if self.profiling_active: self.profiler.start() else: self.profiler.stop()

In _read_reqs_buffer_and_init_reqs:

with self.profiling_lock: if obj.cmd == "start": self.profiling_active = True elif obj.cmd == "stop": self.profiling_active = False

gemini-code-assist · 2025-11-20T17:36:26Z

lightllm/utils/profiler.py

+        if self.mode == "torch_profiler":
+            self._torch_profiler_start()
+        elif self.mode == "nvtx":
+            self._nxtx_start()


There's a typo in the method name _nxtx_start. It should be _nvtx_start to match the NVTX profiling mode.

Suggested change

self._nxtx_start()

self._nvtx_start()

gemini-code-assist · 2025-11-20T17:36:26Z

lightllm/utils/profiler.py

+                self._torch_profiler.start()
+        torch.cuda.synchronize()
+
+    def _nxtx_start(self) -> None:


There's a typo in the method name _nxtx_start. It should be _nvtx_start to match the NVTX profiling mode. This typo will cause a crash when using the nvtx profiler mode. This also needs to be corrected where it's called.

Suggested change

def _nxtx_start(self) -> None:

def _nvtx_start(self) -> None:

gemini-code-assist · 2025-11-20T17:36:27Z

lightllm/server/api_http.py

+async def profiler_stop() -> Response:
+    if g_objs.args.enable_profiling:
+        await g_objs.httpserver_manager.profiler_cmd("stop")
+        return {"status": "ok"}


For consistency, it's better to explicitly use JSONResponse for success responses, just as you do for error responses.

Suggested change

return {"status": "ok"}

return JSONResponse({"status": "ok"})

gemini-code-assist · 2025-11-20T17:36:27Z

lightllm/server/api_http.py

+        await g_objs.httpserver_manager.profiler_cmd("stop")
+        return {"status": "ok"}
+    else:
+        return JSONResponse({"message": "Profiling support not enabled"}, status_code=500)


Returning a 500 Internal Server Error for a disabled feature can be misleading. A 400 Bad Request or 404 Not Found would be more appropriate as the feature is not available, which is not a server failure.

Suggested change

return JSONResponse({"message": "Profiling support not enabled"}, status_code=500)

return JSONResponse({"message": "Profiling support not enabled"}, status_code=400)

gemini-code-assist · 2025-11-20T17:36:27Z

lightllm/utils/profiler.py

+
+@dataclass
+class ProfilerCmd:
+    cmd: str  # "start" or "stop" for now


The type hint for cmd is str, but the comment indicates it only accepts 'start' or 'stop'. Using typing.Literal would make this more explicit and type-safe.

Suggested change

cmd: str # "start" or "stop" for now

cmd: Literal["start", "stop"] # "start" or "stop" for now

gemini-code-assist · 2025-11-20T17:36:28Z

lightllm/utils/profiler.py

+            )
+        elif self.mode == "nvtx":
+            self._nvtx_toplevel_mark = "LIGHTLLM_PROFILE"
+            self._nvtx_toplevel_id = None


The instance variable self._nvtx_toplevel_id is initialized but never used. It can be safely removed to improve code clarity.

gemini-code-assist · 2025-11-20T17:36:28Z

lightllm/utils/profiler.py

+                self._nvtx_toplevel_mark,
+            )
+        elif self.mode is not None:
+            assert False, "invalid profiler mode"


assert False statements are removed when Python is run with the -O (optimize) flag, which can happen in production environments. It's better to raise an exception like ValueError or NotImplementedError for invalid profiler modes to ensure the error is always caught.

Suggested change

assert False, "invalid profiler mode"

raise ValueError("invalid profiler mode")

gemini-code-assist · 2025-11-20T17:36:28Z

lightllm/server/api_http.py

+        await g_objs.httpserver_manager.profiler_cmd("start")
+        return {"status": "ok"}
+    else:
+        return JSONResponse({"message": "Profiling support not enabled"}, status_code=500)


Returning a 500 Internal Server Error for a disabled feature can be misleading. A 400 Bad Request or 404 Not Found would be more appropriate as the feature is not available, which is not a server failure.

Suggested change

return JSONResponse({"message": "Profiling support not enabled"}, status_code=500)

return JSONResponse({"message": "Profiling support not enabled"}, status_code=400)

gemini-code-assist · 2025-11-20T17:36:28Z

lightllm/server/api_http.py

+async def profiler_start() -> Response:
+    if g_objs.args.enable_profiling:
+        await g_objs.httpserver_manager.profiler_cmd("start")
+        return {"status": "ok"}


For consistency, it's better to explicitly use JSONResponse for success responses, just as you do for error responses.

Suggested change

return {"status": "ok"}

return JSONResponse({"status": "ok"})

use --enable_profiling=MODE to enable, currently support torch_profile and nvtx (use with NVIDIA Nsight system) mode

gemini-code-assist bot reviewed Nov 20, 2025

View reviewed changes

WuSiYu force-pushed the feat-profiler-new branch from c6d977e to 84185b5 Compare November 20, 2025 20:12

feat(misc): Profiler support

2cde8e6

use --enable_profiling=MODE to enable, currently support torch_profile and nvtx (use with NVIDIA Nsight system) mode

WuSiYu force-pushed the feat-profiler-new branch from 84185b5 to 2cde8e6 Compare November 25, 2025 07:29

	def _nxtx_start(self) -> None:
	def _nvtx_start(self) -> None:

	return {"status": "ok"}
	return JSONResponse({"status": "ok"})

	return JSONResponse({"message": "Profiling support not enabled"}, status_code=500)
	return JSONResponse({"message": "Profiling support not enabled"}, status_code=400)

	cmd: str # "start" or "stop" for now
	cmd: Literal["start", "stop"] # "start" or "stop" for now

	assert False, "invalid profiler mode"
	raise ValueError("invalid profiler mode")

feat(misc): Profiler support #1121

Are you sure you want to change the base?

feat(misc): Profiler support #1121

Uh oh!

Conversation

WuSiYu commented Nov 20, 2025

Uh oh!

gemini-code-assist bot commented Nov 20, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant