support longcat-image block offload with 2 mgr#977
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces CPU offloading capabilities for the LongCat Image model, specifically implementing block-level offloading. A new LongCatImageOffloadTransformerInfer class is added to manage asynchronous weight prefetching and double-buffering for transformer blocks. The LongCatImageTransformerInfer class is refactored to use a dispatching infer_func, and the main LongCatImageTransformerModel is updated to conditionally use the offload-enabled infer class and manage model/block movement between CPU and GPU. Weight classes (LongCatImageDoubleBlockWeights, LongCatImageSingleBlockWeights) are modified to support creating dedicated CUDA buffers for offloading. A new configuration file and a shell script are included to enable and demonstrate this feature. Feedback points out an unused self.block_idx attribute and a potential device mismatch error if an unsupported offload_granularity is configured.
| current_stream = torch_device_module.current_stream() | ||
| self.offload_manager_double.compute_stream.wait_stream(current_stream) | ||
| for block_idx in range(len(blocks.double_blocks)): | ||
| self.block_idx = block_idx |
| if self.cpu_offload and self.offload_granularity == "block": | ||
| self.transformer_infer_class = LongCatImageOffloadTransformerInfer | ||
| else: | ||
| self.transformer_infer_class = LongCatImageTransformerInfer |
There was a problem hiding this comment.
The logic here only handles offload_granularity == "block". If cpu_offload is enabled but offload_granularity is set to something else (e.g., "phase"), it falls back to the base LongCatImageTransformerInfer. However, the infer method (lines 92-98) only handles "model" and "block" granularities. If a different granularity is provided, the weights will remain on CPU while computation is attempted on GPU, leading to a device mismatch error. Consider adding a check or defaulting to a supported mode.
| for block_idx in range(len(blocks.double_blocks)): | ||
| self.block_idx = block_idx | ||
|
|
||
| if self.offload_manager_double.need_init_first_buffer: |
There was a problem hiding this comment.
我看了下,像是这里,如果每个step的block id=0的时候,都初始化下buffer,那么上面的wait_stream就不用加了,结果也对。感觉可能是上一个step结束的时候,某个step开始之前,swap_blocks在没有完成?所以需要wait下。不过我感觉不是很影响速度,可以先merge
|
pip install ruff pre-commit |
e3bcba7 to
5c52346
Compare
done |
No description provided.