[Core] Avoid one broadcast op when propagating metadata #4844

njhill · 2024-05-16T00:46:32Z

Alternative approach to #4757.

Does not include the changes to only broadcast values or compress the tensor metadata. For the former, it would be simplest to just change the API to broadcast lists rather than dicts imo (update: proposed changes for that added in njhill#2).

rkooo567

QQ: can you compare the overhead with other approach at tp 8?

njhill · 2024-05-20T15:46:10Z

@rkooo567 @youkaichao I haven't tried with tp=8 yet but for TP=2 with llama-2-7b this change gives ~4-5% benefit to end-to-end latency. I expect this will be additive to the benefits from #4894, and I think we can further reduce the number of broadcast ops too.

I didn't compare with #4757 if that's what you mean since I expect they will be equivalent from a perf benefit pov - since the improvement comes from avoiding the additional sizing broadcast op done by torch.distributed.broadcast_object_list in both cases.

@youkaichao I did also make an update on top of this to send lists rather than dicts. The measured benefit of that is negligible but I do still think it's worthwhile. To avoid complicating this PR I've pushed it to a separate one for now: njhill#2

youkaichao · 2024-05-20T17:38:16Z

vllm/worker/model_runner.py

+        # Register call site for sizing broadcast buffer.
+        self.prepare_input_tensors_callsite_id = register_broadcast_callsite()


This code is too intrusive. I thought you can extract callsite information automatically, from the function's call stack.

@youkaichao yes this is possible but a little hacky imo and there's a non-negligible cost to interrogating the stack (including file access I think).

IMHO this is minimally intrusive, especially since: it's not a public API, it's not used from many places, no other class needs to be updated/kept in sync per call site as is the case with the original approach.

In fact we could just remove this so that we aren't sizing per call site - empirically the latency difference due to the tensor size differences in question here isn't measurable and so could be premature optimization in any case. Especially if we combine the two call sites into one.

njhill force-pushed the single-broadcast branch 4 times, most recently from 1e6ccaa to 2cb1a6f Compare May 16, 2024 03:12

[Core] Avoid one broadcast op when propagating metadata

e969cc9

njhill force-pushed the single-broadcast branch from 2cb1a6f to e969cc9 Compare May 16, 2024 19:23

njhill mentioned this pull request May 16, 2024

[Core][Distributed] add fast broadcast for tensor dict #4757

Open

1 task

njhill requested a review from youkaichao May 16, 2024 21:19

rkooo567 reviewed May 17, 2024

View reviewed changes

njhill mentioned this pull request May 20, 2024

Use lists for broadcast njhill/vllm#2

Open

njhill mentioned this pull request May 20, 2024

[Core] Eliminate parallel worker per-step task scheduling overhead #4894

Merged

njhill marked this pull request as ready for review May 20, 2024 16:40

youkaichao reviewed May 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Avoid one broadcast op when propagating metadata #4844

[Core] Avoid one broadcast op when propagating metadata #4844

njhill commented May 16, 2024 •

edited

rkooo567 left a comment

njhill commented May 20, 2024

youkaichao May 20, 2024

njhill May 20, 2024

		# Register call site for sizing broadcast buffer.
		self.prepare_input_tensors_callsite_id = register_broadcast_callsite()

[Core] Avoid one broadcast op when propagating metadata #4844

Are you sure you want to change the base?

[Core] Avoid one broadcast op when propagating metadata #4844

Conversation

njhill commented May 16, 2024 • edited

rkooo567 left a comment

Choose a reason for hiding this comment

njhill commented May 20, 2024

youkaichao May 20, 2024

Choose a reason for hiding this comment

njhill May 20, 2024

Choose a reason for hiding this comment

njhill commented May 16, 2024 •

edited