-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slower Matrix multiplication than numpy #2660
Comments
Mojo is not faster than numpy for now, but it will be. The point of mojo is to make it a system level language which provides the same if not more performance as C/C++ without having to write convoluted C code. It is still a very young language with much to improve. |
@taalhaataahir0102 Numpy (in most cases) delegates the linear algebra operations to optimized libraries like BLAS and LAPACK, the same approch which is also taken by other langages like Julia which also delegates the linalg operations. so the matumul in numpy it is not "just" numpy. Optimized libraries also have different implementation based on factors like shape, size, and data type of the matrix .. etc. |
Thanks for clarifying, also note that the MatMul used for the MAX engine is much faster than Numpy, as it's optimized for different CPU's: https://www.modular.com/blog/the-worlds-fastest-unified-matrix-multiplication. We should check if MAX is installed and run the fully optimized version to compare performance in this example. |
Thanks @jackos for the reply. I was trying MAX graph API, created a simple graph which takes in 2 tensors of shape 100,100 as input and multiply them 1000 times. Similarly I created 2 random numpy arrays and multiplied them 1000 time. Still numpy is performing better (around 20-25x faster than MAX graph API). Here's my code:
And these are the results (in seconds): Can you please confirm if there's something I'm doing wrong? Here are my system details: |
try bigger arrays, the numpy (AOCL-BLAS / AOCL-LAPACK for your AMD CPU) impl has probably some optimizations for small arrays that the Modular team might've yet to add. Or it's cheating and using the rocBLAS AMD GPU implementation XD.
In my case Mojo doesn't use every thread available for my CPU, even though it has the ISA extensions Mojo expects. Raised an issue #2398 but never got an answer. There are some issues for consumer grade hardware that simply are not a priority for them as far as I see it. If you have some kind of workstation you should try it on there. After a quick google search it seems like your Ryzen 9 5900HX also has the ISA extensions required. |
Bug description
I've tried running the Mojo matmul file available in the repository inside examples directory (https://github.com/modularml/mojo/blob/main/examples/matmul.mojo)
The output of the file shows that the most optimized matrix multiplication in Mojo is still 3 to 4 times slower than that in Numpy. Following are the results:
Could someone please explain the performance difference I'm seeing?
The most common operation in machine learning is matrix multiplication, and I've noticed that it's slower in Mojo compared to NumPy. NumPy, which is a Python library wrapping optimized C code, is commonly used for AI tasks. Considering that most people use NumPy for these purposes (rather than using matmul written purely in python for ML tasks), what's the motivation behind using Mojo if it's not performing as well as conventional Python code using NumPy?
Steps to reproduce
System information
The text was updated successfully, but these errors were encountered: