Slower Matrix multiplication than numpy #2660

taalhaataahir0102 · 2024-05-15T08:59:31Z

Bug description

I've tried running the Mojo matmul file available in the repository inside examples directory (https://github.com/modularml/mojo/blob/main/examples/matmul.mojo)
The output of the file shows that the most optimized matrix multiplication in Mojo is still 3 to 4 times slower than that in Numpy. Following are the results:

CPU Results

Python:         0.003 GFLOPS
Numpy:        363.124 GFLOPS
Naive:          6.225 GFLOPS   2138.28x Python
Vectorized:    22.350 GFLOPS   7677.58x Python
Parallelized: 102.933 GFLOPS  35358.39x Python
Tiled:        104.982 GFLOPS  36062.43x Python
Unrolled:     107.915 GFLOPS  37069.74x Python

Could someone please explain the performance difference I'm seeing?
The most common operation in machine learning is matrix multiplication, and I've noticed that it's slower in Mojo compared to NumPy. NumPy, which is a Python library wrapping optimized C code, is commonly used for AI tasks. Considering that most people use NumPy for these purposes (rather than using matmul written purely in python for ML tasks), what's the motivation behind using Mojo if it's not performing as well as conventional Python code using NumPy?

Steps to reproduce

git clone https://github.com/modularml/mojo.git
cd mojo-main/examples
mojo build matmul.mojo
./matmul

System information

OS: Ubuntu 22.04.3 LTS
Mojo version: mojo 24.3.0 (9882e19d)
Modular version: modular 0.7.4 (df7a9e8b)

The text was updated successfully, but these errors were encountered:

LJ-9801 · 2024-05-16T16:08:00Z

Mojo is not faster than numpy for now, but it will be. The point of mojo is to make it a system level language which provides the same if not more performance as C/C++ without having to write convoluted C code. It is still a very young language with much to improve.

MoSafi2 · 2024-05-16T19:19:47Z

@taalhaataahir0102 Numpy (in most cases) delegates the linear algebra operations to optimized libraries like BLAS and LAPACK, the same approch which is also taken by other langages like Julia which also delegates the linalg operations. so the matumul in numpy it is not "just" numpy. Optimized libraries also have different implementation based on factors like shape, size, and data type of the matrix .. etc.
https://numpy.org/doc/stable/reference/routines.linalg.html
The provided exmaples in the mojo repo is a toy example with what kinds of optimization possible in the language itself with zero-dependancies on 3rd party libraries. the performance attained by BLAS is possible to achieve and surpass in mojo (currently implemented in the closed-source max engine) and maybe it can be implemented in open source with more people intrested in the language.

jackos · 2024-05-28T22:50:36Z

Thanks for clarifying, also note that the MatMul used for the MAX engine is much faster than Numpy, as it's optimized for different CPU's: https://www.modular.com/blog/the-worlds-fastest-unified-matrix-multiplication. We should check if MAX is installed and run the fully optimized version to compare performance in this example.

taalhaataahir0102 · 2024-05-29T15:27:40Z

Thanks @jackos for the reply. I was trying MAX graph API, created a simple graph which takes in 2 tensors of shape 100,100 as input and multiply them 1000 times. Similarly I created 2 random numpy arrays and multiplied them 1000 time. Still numpy is performing better (around 20-25x faster than MAX graph API). Here's my code:

from max.graph import Graph, TensorType, Type
from max import engine
from tensor import Tensor, TensorShape, randn
from time import now
from python import Python

def main():
    graph = Graph(in_types=List[Type](TensorType(DType.float32, 100,100), TensorType(DType.float32, 100,100)))
    out = graph[0] @ graph[1]
    graph.output(out)
    graph.verify()

    session = engine.InferenceSession()
    model = session.load(graph)

    var input0 = randn[DType.float32]((100, 100))
    var input1 = randn[DType.float32]((100, 100))
    var start = now()
    for i in range(1000):
        var ret = model.execute("input0", input0, "input1", input1)
    var end = now()

    var execution_time_seconds :  Float32 = (end-start) / 1000000000
    print("MAX GRAPH API:",execution_time_seconds)

    var np = Python.import_module("numpy")
    array1 = np.random.rand(100, 100)
    array2 = np.random.rand(100, 100)
    var start1 = now()
    for i in range(1000):
        var result = np.dot(array1, array2)
    var end1 = now()

    var execution_time_seconds1 :  Float32 = (end1-start1) / 1000000000
    print("NUMPY:",execution_time_seconds1)

And these are the results (in seconds):

Can you please confirm if there's something I'm doing wrong? Here are my system details:
MAX version: max 24.3.0 (9882e19d)
Mojo version: mojo 24.3.0 (9882e19d)
OS: Ubuntu 22.04.3 LTS

martinvuyk · 2024-05-31T21:46:37Z

2 tensors of shape 100,100 as input and multiply them 1000 times

try bigger arrays, the numpy (AOCL-BLAS / AOCL-LAPACK for your AMD CPU) impl has probably some optimizations for small arrays that the Modular team might've yet to add. Or it's cheating and using the rocBLAS AMD GPU implementation XD.

Here are my system details

In my case Mojo doesn't use every thread available for my CPU, even though it has the ISA extensions Mojo expects. Raised an issue #2398 but never got an answer. There are some issues for consumer grade hardware that simply are not a priority for them as far as I see it. If you have some kind of workstation you should try it on there.

After a quick google search it seems like your Ryzen 9 5900HX also has the ISA extensions required.
btw don't post your linux username on a public forum

taalhaataahir0102 added bug Something isn't working mojo-repo Tag all issues with this label labels May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slower Matrix multiplication than numpy #2660

Slower Matrix multiplication than numpy #2660

taalhaataahir0102 commented May 15, 2024 •

edited by ematejska

LJ-9801 commented May 16, 2024

MoSafi2 commented May 16, 2024

jackos commented May 28, 2024

taalhaataahir0102 commented May 29, 2024

martinvuyk commented May 31, 2024

Slower Matrix multiplication than numpy #2660

Slower Matrix multiplication than numpy #2660

Comments

taalhaataahir0102 commented May 15, 2024 • edited by ematejska

Bug description

Steps to reproduce

System information

LJ-9801 commented May 16, 2024

MoSafi2 commented May 16, 2024

jackos commented May 28, 2024

taalhaataahir0102 commented May 29, 2024

martinvuyk commented May 31, 2024

taalhaataahir0102 commented May 15, 2024 •

edited by ematejska