-
Notifications
You must be signed in to change notification settings - Fork 3.3k
[Mlas] optimize MlasConv using thread partition opt #25255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@zoeczy please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
Contributor License AgreementContribution License AgreementThis Contribution License Agreement (“Agreement”) is agreed to by the party signing below (“You”),
|
Description
This PR enhances
MlasConv
in ONNX Runtime by introducing a thread partitioning strategy based on Batch Size (bs
) and Group Count (group
). This change improves multi-threading efficiency in convolution scenarios where scaling with core/thread count was previously limited.The PR also includes updates to the
bench_sconv
utility to support and evaluate multi-threaded performance under the new partitioning strategy.numactl -C core_num0-core_num_1 ./onnxruntime_mlas_benchmark --benchmark_filter=Teams
Compared to the current master implementation, the optimized version exhibits nearly 3× performance improvement, showing effective scaling with thread count. In contrast, the master branch shows no meaningful performance gain when increasing the number of threads, due to insufficient parallelization in the original implementation.
Motivation and Context
MlasConv
exhibited minimal performance gains when increasing the number of threads or CPU cores in scenarios with small batch sizes or grouped convolutions.bench_sconv
show a noticeable performance improvement in multi-threaded runs, especially on multi-core CPUs.Releated Issues
#25152