Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slower Matrix multiplication than numpy #2660

Open
taalhaataahir0102 opened this issue May 15, 2024 · 5 comments
Open

Slower Matrix multiplication than numpy #2660

taalhaataahir0102 opened this issue May 15, 2024 · 5 comments
Labels
bug Something isn't working mojo-repo Tag all issues with this label

Comments

@taalhaataahir0102
Copy link

taalhaataahir0102 commented May 15, 2024

Bug description

I've tried running the Mojo matmul file available in the repository inside examples directory (https://github.com/modularml/mojo/blob/main/examples/matmul.mojo)
The output of the file shows that the most optimized matrix multiplication in Mojo is still 3 to 4 times slower than that in Numpy. Following are the results:

CPU Results

Python:         0.003 GFLOPS
Numpy:        363.124 GFLOPS
Naive:          6.225 GFLOPS   2138.28x Python
Vectorized:    22.350 GFLOPS   7677.58x Python
Parallelized: 102.933 GFLOPS  35358.39x Python
Tiled:        104.982 GFLOPS  36062.43x Python
Unrolled:     107.915 GFLOPS  37069.74x Python

Could someone please explain the performance difference I'm seeing?
The most common operation in machine learning is matrix multiplication, and I've noticed that it's slower in Mojo compared to NumPy. NumPy, which is a Python library wrapping optimized C code, is commonly used for AI tasks. Considering that most people use NumPy for these purposes (rather than using matmul written purely in python for ML tasks), what's the motivation behind using Mojo if it's not performing as well as conventional Python code using NumPy?

Steps to reproduce

git clone https://github.com/modularml/mojo.git
cd mojo-main/examples
mojo build matmul.mojo
./matmul

image

System information

OS: Ubuntu 22.04.3 LTS
Mojo version: mojo 24.3.0 (9882e19d)
Modular version: modular 0.7.4 (df7a9e8b)
@taalhaataahir0102 taalhaataahir0102 added bug Something isn't working mojo-repo Tag all issues with this label labels May 15, 2024
@LJ-9801
Copy link
Contributor

LJ-9801 commented May 16, 2024

Mojo is not faster than numpy for now, but it will be. The point of mojo is to make it a system level language which provides the same if not more performance as C/C++ without having to write convoluted C code. It is still a very young language with much to improve.

@MoSafi2
Copy link
Contributor

MoSafi2 commented May 16, 2024

@taalhaataahir0102 Numpy (in most cases) delegates the linear algebra operations to optimized libraries like BLAS and LAPACK, the same approch which is also taken by other langages like Julia which also delegates the linalg operations. so the matumul in numpy it is not "just" numpy. Optimized libraries also have different implementation based on factors like shape, size, and data type of the matrix .. etc.
https://numpy.org/doc/stable/reference/routines.linalg.html
The provided exmaples in the mojo repo is a toy example with what kinds of optimization possible in the language itself with zero-dependancies on 3rd party libraries. the performance attained by BLAS is possible to achieve and surpass in mojo (currently implemented in the closed-source max engine) and maybe it can be implemented in open source with more people intrested in the language.

Copy link
Collaborator

jackos commented May 28, 2024

Thanks for clarifying, also note that the MatMul used for the MAX engine is much faster than Numpy, as it's optimized for different CPU's: https://www.modular.com/blog/the-worlds-fastest-unified-matrix-multiplication. We should check if MAX is installed and run the fully optimized version to compare performance in this example.

@taalhaataahir0102
Copy link
Author

Thanks @jackos for the reply. I was trying MAX graph API, created a simple graph which takes in 2 tensors of shape 100,100 as input and multiply them 1000 times. Similarly I created 2 random numpy arrays and multiplied them 1000 time. Still numpy is performing better (around 20-25x faster than MAX graph API). Here's my code:

from max.graph import Graph, TensorType, Type
from max import engine
from tensor import Tensor, TensorShape, randn
from time import now
from python import Python

def main():
    graph = Graph(in_types=List[Type](TensorType(DType.float32, 100,100), TensorType(DType.float32, 100,100)))
    out = graph[0] @ graph[1]
    graph.output(out)
    graph.verify()

    session = engine.InferenceSession()
    model = session.load(graph)

    var input0 = randn[DType.float32]((100, 100))
    var input1 = randn[DType.float32]((100, 100))
    var start = now()
    for i in range(1000):
        var ret = model.execute("input0", input0, "input1", input1)
    var end = now()

    var execution_time_seconds :  Float32 = (end-start) / 1000000000
    print("MAX GRAPH API:",execution_time_seconds)

    var np = Python.import_module("numpy")
    array1 = np.random.rand(100, 100)
    array2 = np.random.rand(100, 100)
    var start1 = now()
    for i in range(1000):
        var result = np.dot(array1, array2)
    var end1 = now()

    var execution_time_seconds1 :  Float32 = (end1-start1) / 1000000000
    print("NUMPY:",execution_time_seconds1)

And these are the results (in seconds):
image

Can you please confirm if there's something I'm doing wrong? Here are my system details:
MAX version: max 24.3.0 (9882e19d)
Mojo version: mojo 24.3.0 (9882e19d)
OS: Ubuntu 22.04.3 LTS

@martinvuyk
Copy link

2 tensors of shape 100,100 as input and multiply them 1000 times

try bigger arrays, the numpy (AOCL-BLAS / AOCL-LAPACK for your AMD CPU) impl has probably some optimizations for small arrays that the Modular team might've yet to add. Or it's cheating and using the rocBLAS AMD GPU implementation XD.

Here are my system details

In my case Mojo doesn't use every thread available for my CPU, even though it has the ISA extensions Mojo expects. Raised an issue #2398 but never got an answer. There are some issues for consumer grade hardware that simply are not a priority for them as far as I see it. If you have some kind of workstation you should try it on there.

After a quick google search it seems like your Ryzen 9 5900HX also has the ISA extensions required.
btw don't post your linux username on a public forum

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working mojo-repo Tag all issues with this label
Projects
None yet
Development

No branches or pull requests

5 participants