Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark results look incorrect? #31

Open
danieldk opened this issue Aug 17, 2020 · 2 comments
Open

Benchmark results look incorrect? #31

danieldk opened this issue Aug 17, 2020 · 2 comments

Comments

@danieldk
Copy link
Contributor

danieldk commented Aug 17, 2020

% export OMP_NUM_THREADS=1
% python -m blis.benchmark
Setting up data for gemm. 1000 iters,  nO=384 nI=384 batch_size=2000
Blis gemm...
Total: 11032014.6484375
9.54 seconds
Numpy (openblas) gemm...
Total: 11032015.625
9.50 seconds
Blis einsum ab,cb->ca
Total: 5510590.8203125
9.78 seconds
Numpy (openblas) einsum ab,cb->ca
unset OMP_NUM_THREADS
Total: 5510596.19140625
90.67 seconds

numpy with OpenBLAS and blis are on-par for gemm. However, this does not use intermediate optimization on numpy's einsum. Enabling this by passing optimize=True:

% python -m blis.benchmark
Setting up data for gemm. 1000 iters,  nO=384 nI=384 batch_size=2000
Blis gemm...
Total: 11032014.6484375
9.62 seconds
Numpy (openblas) gemm...
Total: 11032015.625
9.51 seconds
Blis einsum ab,cb->ca
Total: 5510590.8203125
9.70 seconds
Numpy (openblas) einsum ab,cb->ca
Total: 5510592.28515625
11.43 seconds

Only slightly slower than blis now. However, I am skeptical of the claim that parallelization does not help in inference. The matrix sizes used in the benchmark are fairly typical in inference (e.g. the standard transformer attention matrices are 768x768). Testing with 4 threads (fairly modest on current multi-core SMT CPUs):

% export OMP_NUM_THREADS=4
% python -m blis.benchmark
Setting up data for gemm. 1000 iters,  nO=384 nI=384 batch_size=2000
Blis gemm...
Total: 11032014.6484375
9.77 seconds
Numpy (openblas) gemm...
Total: 11032015.625
3.40 seconds
Blis einsum ab,cb->ca
Total: 5510590.8203125
9.83 seconds
Numpy (openblas) einsum ab,cb->ca
Total: 5510592.28515625
4.53 seconds

Maybe it's worthwhile compiling blis with multi-threading support?

For reference:

% lscpu | grep name:
Model name:          Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz
@honnibal
Copy link
Member

Ah, when I wrote that the optimize flag for einsum wasn't available for numpy. Thanks for pointing it out!

Regarding the threading in inference, the big issue is that you'll usually be able to parallelise inference at a higher loop, e.g. by starting more processes to work on your data. This is generally a better strategy for inference, usually outside of benchmarks we're inferring as part of a much larger workload, and it makes sense to parallelise the whole sequence. In a cloud context, we can also choose to use smaller instances rather than larger instances.

The position I take is that no library should launch more threads than the user explicitly requests. The default behaviour of OpenBLAS to launch as many threads as possible causes a lot of problems.

That said, I do think it'd be good to compile Blis with threading and leave it disabled until the user calls a function to increase it. But for our current purposes, the single-threading mode is good.

@danieldk
Copy link
Contributor Author

The position I take is that no library should launch more threads than the user explicitly requests. The default behaviour of OpenBLAS to launch as many threads as possible causes a lot of problems.

That said, I do think it'd be good to compile Blis with threading and leave it disabled until the user calls a function to increase it. But for our current purposes, the single-threading mode is good.

Agreed. I can't say I like OpenBLAS' default behavior, which often also leads to performance regressions on many-core CPUs. Making it a user-configurable options sounds sensible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants