Skip to content

Conversation

@mag1c-h
Copy link
Contributor

@mag1c-h mag1c-h commented Nov 29, 2025

Purpose

[bugfix] batch trans on cuda with SM return 700 error.

Modifications

No any user-facing changed.

Test

add unit test: ucm/shared/test/case/trans/trans_test.cc
add example: ucm/shared/test/example/trans/trans_on_cuda_example.py

Run example on H20:

python3 ucm/shared/test/example/trans/trans_on_cuda_example.py
INFO 11-28 18:49:09 [__init__.py:244] Automatically detected platform cuda.
[2025-11-28 18:49:14] - ucm.integration.vllm.patch.apply_patch - INFO [apply_patch.py:106] All vLLM patches applied successfully for version 0.9.2
ucmtrans: fdc31dfc13d6e76987d4c55e7c1f3acfc1e83dd4-Debug
========>> Running in trans_with_ce:
make: (18432,) 2 [0. 1. 2. ... 0. 0. 0.]
make: (18432,) 2 [0. 0. 0. ... 0. 0. 0.]
cost: 0.025137220975011587s
bandwidth: 5.725257224856523GB/s
compare[1]: (18432,) 2 [0. 1. 2. ... 0. 0. 0.]
compare[2]: (18432,) 2 [0. 1. 2. ... 0. 0. 0.]

========>> Running in trans_with_sm:
make: (18432,) 2 [0. 1. 2. ... 0. 0. 0.]
make: (18432,) 2 [0. 1. 2. ... 0. 0. 0.]
cost: 0.006038442254066467s
bandwidth: 23.83347392335862GB/s
compare[1]: (18432,) 2 [0. 1. 2. ... 0. 0. 0.]
compare[2]: (18432,) 2 [0. 1. 2. ... 0. 0. 0.]

========>> Running in trans_with_ce_async:
make: (18432,) 2 [0. 1. 2. ... 0. 0. 0.]
make: (18432,) 2 [0. 1. 2. ... 0. 0. 0.]
cost: 0.024610714055597782s
bandwidth: 5.847739958900771GB/s
compare[1]: (18432,) 2 [0. 1. 2. ... 0. 0. 0.]
compare[2]: (18432,) 2 [0. 1. 2. ... 0. 0. 0.]

========>> Running in trans_with_sm_async:
make: (18432,) 2 [0. 1. 2. ... 0. 0. 0.]
make: (18432,) 2 [0. 1. 2. ... 0. 0. 0.]
cost: 0.005752581637352705s
bandwidth: 25.017820706014273GB/s
compare[1]: (18432,) 2 [0. 1. 2. ... 0. 0. 0.]
compare[2]: (18432,) 2 [0. 1. 2. ... 0. 0. 0.]

========>> Running in trans_batch_with_ce:
make: (18432,) 2 [1023.    0.    0. ...    0.    0. 1023.]
make: (18432,) 2 [0. 0. 0. ... 0. 0. 0.]
cost: 0.025805798824876547s
bandwidth: 5.576926991357668GB/s

========>> Running in trans_batch_with_sm:
make: (18432,) 2 [1023.    0.    0. ...    0.    0. 1023.]
make: (18432,) 2 [4924.    0.    0. ...    0.    0. 4924.]
cost: 0.005966973956674337s
bandwidth: 24.11893483111688GB/s

========>> Running in trans_batch_with_ce_async:
make: (18432,) 2 [1023.    0.    0. ...    0.    0. 1023.]
make: (18432,) 2 [4924.    0.    0. ...    0.    0. 4924.]
cost: 0.025461182929575443s
bandwidth: 5.652410431913886GB/s

========>> Running in trans_batch_with_sm_async:
make: (18432,) 2 [1023.    0.    0. ...    0.    0. 1023.]
make: (18432,) 2 [4924.    0.    0. ...    0.    0. 4924.]
cost: 0.005798668134957552s
bandwidth: 24.81898474796118GB/s

@ygwpz ygwpz merged commit 77f5090 into ModelEngine-Group:develop Nov 29, 2025
4 checks passed
@mag1c-h mag1c-h deleted the dev-trans branch November 29, 2025 04:00
mag1c-h added a commit that referenced this pull request Nov 29, 2025
cuda trans batch api bug fix

(cherry picked from commit 77f5090)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants