Refactor code according to upstream changes #62
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2e1758d to
3aa8196
Compare
|
Still needs some work; moved back to draft. |
### sequence level processing -> batch level processing In this PR the code for preparing the input tensors for the AIU is completely rewritten based on the assumption that we have to finish the current decoding on AIU before doing another prefill. Changes: * [rewriting](https://github.ibm.com/ai-foundation/vllm/pull/62/commits/cea122c220b18e3de3dce95faa5e03fe3efe0835) `sendnn_model_runner.py`, `sendnn_worker.py` and `sendnn.py` based on the above constraint. * [removing](https://github.ibm.com/ai-foundation/vllm/pull/62/commits/6869231d83734d3c03ffd15bc6754c1857d063cc) class variable `self._padded_batch_size` since other solution implemented * [removing](https://github.ibm.com/ai-foundation/vllm/pull/62/commits/ff9ebf6923fd9ac6c99e64dfffc7763f6c194399) the unused `input_block_ids` since AIU does not support paged attention yet. * [removing](https://github.ibm.com/ai-foundation/vllm/pull/62/commits/a6d63899bf3d9fae59edde414b8bd2a3c56bc8c7) some unused function arguments in model loading * [removing](https://github.ibm.com/ai-foundation/vllm/pull/62/commits/4527300ee9be4dd1fb76007fb6e0862b97d51676) unused function _get_model_architecture() and global variable `_SENDNN_SUPPORTED_MODELS` The code has been tested in client/server mode for the `llama 194m` and `granite 3b` on `AIU` and `CPU`.
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
|
All the changes make sense to me. I'll run one of the embedding benchmarks to validate the embedding model branch of the changes. |
|
I'm getting the same results:
|
yannicks1
left a comment
There was a problem hiding this comment.
great work! I added some minor comments, but LGTM overall.
Tested on CPU with inductor compilation using the tests in tests/spyre for llama194m and gpt 3b. On Spyre the online mode for grantite 3b, 8b and 13b have been tested successfully.
| self.is_driver_worker) | ||
| self._env_initialized = False | ||
|
|
||
| def init_distributed_environment(self) -> None: |
There was a problem hiding this comment.
Would it make sense to rename that function to better distinguish between the imported function init_distributed_environment from vllm.distributed and our own function self.init_distributed_environment defined in line 62?
There was a problem hiding this comment.
Yes, seems reasonable to me.
| @@ -86,11 +74,6 @@ def init_distributed_environment(self) -> None: | |||
| # A small all_reduce for warmup. | |||
| torch.distributed.all_reduce(torch.zeros(1).cpu()) | |||
There was a problem hiding this comment.
out of curiosity: why is that needed here?
There was a problem hiding this comment.
It probably isn't strictly needed, it's just like a verification check that the distributed setup it working.
@yannicks1 Yes please do that, as a separate PR. Thanks! |
|
okay, I will address the function renaming and argument pruning in a separate PR after we merged this PR. From my side this PR is ready to be merged! |
|
@tdoublep @sducouedic I addressed my to minor comments in this PR which should be merged BEFORE merging this PR. |
this PR improves code readability by - [commit 1](0c9bec0): removing the batch size argument in the load_model function [here](https://github.com/IBM/vllm/blob/124f3a961d1a9ce2628c01fe56dfcc589a49c8dd/vllm/worker/spyre_worker.py#L135) since unused in `SpyreModelRunner` ([here](https://github.com/IBM/vllm/blob/124f3a961d1a9ce2628c01fe56dfcc589a49c8dd/vllm/worker/spyre_model_runner.py#L104)) as well as in `SpyreEmbeddingModelRunner` ([here](https://github.com/IBM/vllm/blob/124f3a961d1a9ce2628c01fe56dfcc589a49c8dd/vllm/worker/spyre_embedding_model_runner.py#L52)). - -- _[commit 2](beadbd5): function renaming to better distinguish between the imported function `init_distributed_environment` from `vllm.distributed` and our own function `self.init_distributed_environment`._ -- (**reverted since this seems to be a convention introduced in other worker classes**) --------- Signed-off-by: Yannick Schnider <Yannick.Schnider1@ibm.com>
857b127 to
85a8d55
Compare
This PR reworks our code according to some important upstream changes. In particular, there is no longer any need to have a separate
SpyreExecutorandMultiprocessingSpyreExecutor. Upstream has added generic classes for this that work across different platforms. Acutally, it simplifies our code quite a lot.The model runner classes now inherit from
ModelRunnerBaseand we need to define aModelInputForSpyreclass accordingly.This is current passing all CPU tests, but needs to be tested on Spyre and needs careful review since it quite a big change.
Note: the target for this PR is a branch
upstream-2025-01-17containing upstream changes merged into our current branch. I've done it like this so it is easier to review the changes. If this PR is approved, we can then merge the changes intoupstream-2025-01-17and then merge that one into main.