Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Octave micro benchmarks don't represent language characteristics #13042

Closed
oheim opened this issue Sep 9, 2015 · 18 comments
Closed

Octave micro benchmarks don't represent language characteristics #13042

oheim opened this issue Sep 9, 2015 · 18 comments

Comments

@oheim
Copy link

oheim commented Sep 9, 2015

Let me risk another approach on #2412 and #5128.

The Octave benchmark timings look rather extreme on your benchmark table (julialang.org). I have looked at the benchmark code and it simply does not represent how one would actually program in that language. You write that the micro benchmarks shall “give a sense how … numerical programming in that particular language is … all of the benchmarks are written to test the performance of specific algorithms, expressed in a reasonable idiom in each language”. The micro benchmark for Octave absolutely fails that pretended target.

Let me show you how I would reasonably implement the algorithms in Octave (probably also works in proprietary Matlab, but I can't check that). I haven't spend more than a minute to think about each algorithm.

  1. Fibonacci: Nobody would use a recursive algorithm in Octave if you can preallocate the memory and compute the number in a loop.
  2. Parseint: Why use a loop when vectorization is trivial?
  3. Mandelbrot set: Meshgrid is a typical function for creation of a 2D set. Again, vectorization is trivial if indexing is used. Indexing is a key language concept in Octave.
  4. Quicksort: The algorithm is much short, easier and faster if indexing and recursion is used instead of in-place computation with loops.
  5. Pisum and printfd: Again, vectorization is a key language concept that almost all functions support. Not using it shows that the programmer has no clue how Octave works.

Could you please clarify what the actual purpose of your benchmarks is? I would be fine with the benchmarks saying: Loops and recursion in Octave are slow. However, the benchmark table suggests that, e.g., computing the Mandelbrot set in Octave is much slower than it actually is.

diff --git a/test/perf/micro/perf.m b/test/perf/micro/perf.m
index 6c6a846..2a1fadb 100644
--- a/test/perf/micro/perf.m
+++ b/test/perf/micro/perf.m
@@ -32,9 +32,9 @@ function perf()
        assert(issorted(sortperf(5000)))
        timeit('quicksort', @sortperf, 5000)

-       s = pisum(true);
+       s = pisumvec(true);
        assert(abs(s-1.644834071848065) < 1e-12);
-       timeit('pi_sum',@pisum, true)
+       timeit('pi_sum',@pisumvec, true)

        %s = pisumvec(true);
        %assert(abs(s-1.644834071848065) < 1e-12);
@@ -80,25 +80,29 @@ function timeit(name, func, varargin)
 end

 %% recursive fib %%
+%% (the function definition is recursive, but the algorithm uses a loop) %%

-function f = fib(n)
-    if n < 2
-        f = n;
+function f = fib (n)
+    if n <= 0
+        f = 0;
         return
-    else
-        f = fib(n-1) + fib(n-2);
     end
+    
+    f = ones (1, n);
+    for k = 3 : n
+        f(k) = f(k-2) + f(k-1);
+    end
+    f = f(n);
 end

 %% parse int %%
+%% create t random uint32 numbers, convert them all to hex and parse them %%

-function n = parseintperf(t)
-    for i = 1:t
-        n = randi([0,2^32-1],1,'uint32');
-        s = dec2hex(n);
-        m = hex2dec(s);
-        assert(m == n);
-    end
+function n = parseintperf (t)
+    n = randi ([intmin('uint32'), intmax('uint32')], t, 1, 'uint32');
+    s = dec2hex (n);
+    m = hex2dec (s);
+    assert (m == n);
 end

 %% matmul and transpose %%
@@ -109,60 +113,44 @@ end

 %% mandelbrot set: complex arithmetic and comprehensions %%

-function n = mandel(z)
-    n = 0;
+function n = mandel (z)
+    n = zeros (size (z));
     c = z;
-    for n=0:79
-        if abs(z)>2
-            return
+    for k = 1 : 80
+        idx = abs (z) <= 2;
+        if any (any (idx))
+            n(idx) ++;
+            z(idx) = z(idx) .^ 2 + c(idx);
+        else
+            break
         end
-        z = z^2+c;
     end
-    n = 80;
 end

 function M = mandelperf(ignore)
-  M = zeros(length(-2.0:.1:0.5), length(-1:.1:1));
-  count = 1;
-  for r = -2:0.1:0.5
-    for i = -1:.1:1
-      M(count) = mandel(complex(r,i));
-      count = count + 1;
-    end
-  end
+  [r, i] = meshgrid (-2 : .1 : 0.5, -1 : .1 : 1);
+  M = mandel (complex (r, i));
 end

 %% numeric vector quicksort %%

-function b = qsort(a)
-    b = qsort_kernel(a, 1, length(a));
-end
-
-function a = qsort_kernel(a, lo, hi)
-    i = lo;
-    j = hi;
-    while i < hi
-        pivot = a(floor((lo+hi)/2));
-       while i <= j
-              while a(i) < pivot, i = i + 1; end
-              while a(j) > pivot, j = j - 1; end
-              if i <= j
-                t = a(i);
-                a(i) = a(j);
-                a(j) = t;
-                i = i + 1;
-                j = j - 1;
-                     end
-       end
-        if lo < j; a=qsort_kernel(a, lo, j); end
-        lo = i;
-           j = hi;
+function b = qsort (a)
+    if length (a) <= 1
+        b = a;
+        return
     end
+    
+    pivot = a(1);
+    lo = a <= pivot;
+    hi = a >= pivot;
+    % keep pivot out of lo and hi
+    lo(1) = hi(1) = false;
+    b = vertcat (qsort (a(lo)), pivot, qsort (a(hi)));
 end

-function v = sortperf(n)
-    v = rand(n,1);
-    v = qsort(v);
+function v = sortperf (n)
+    v = rand (n, 1);
+    v = qsort (v);
 end

 %% slow pi series %%
@@ -180,7 +168,7 @@ end
 %% slow pi series, vectorized %%

 function s = pisumvec(ignore)
-    a = [1:10000]
+    a = [1:10000];
     for j=1:500
         s = sum( 1./(a.^2));
     end
@@ -224,10 +212,8 @@ end

 %% printf %%

-function printfd(n)
-    f = fopen('/dev/null','w');
-    for i = 1:n
-        fprintf(f, '%d %d\n', i, i);
-    end
-    fclose(f);
+function printfd (n)
+    f = fopen ('/dev/null', 'w');
+    fprintf (f, '%d %d\n', 1:n, (1:n)+1);
+    fclose (f);
 end

@JeffBezanson
Copy link
Sponsor Member

Please reread #5128 (comment). This is well-trod ground.

@oheim
Copy link
Author

oheim commented Sep 9, 2015

As stated above I disagree with that comment for several reasons:

  • The algorithms are the same (see my changeset). Vectorization just groups same batch work together (in C you have a compiler that does it, in other languages you can write shorter code). For example, fibonacci numbers are still computed according to their recursive definition, quicksort uses the recursive algorithm with pivoting and sorting, mandelbrot counts the steps it needs until |z| > 2, and so on.
  • The wording is misleading (see my first post) and must be changed if you want to leave the benchmarks as they are.
  • Using C paradigms in other languages where they don't apply and calling it the “natural way to code the algorithm” ?!?

@JuliaLang JuliaLang locked and limited conversation to collaborators Sep 9, 2015
@JuliaLang JuliaLang unlocked this conversation Sep 9, 2015
@JeffBezanson
Copy link
Sponsor Member

Actually, I do think it would be interesting to compare your implementations --- they have a nice balance of vectorizing but not just calling one builtin function to do all the work. It would be good to submit your code as a pull request.

@tkelman
Copy link
Contributor

tkelman commented Sep 9, 2015

and the matlab jit apparently got quite a bit smarter with 2015b, so if we have that on the benchmark machine it'll also be interesting to look at - though we all have better things to do than deal with matlab's installer

@oheim
Copy link
Author

oheim commented Sep 10, 2015

I am going to prepare a pull request. May I remove the deprecated tests, which have been commented out, to clean up the file?

@ViralBShah
Copy link
Member

The same argument has been made about R as well, and about Matlab in the past. The purpose was mainly to be able to judge how Julia compares with other language implementations at writing simple loops and other structures. Clearly, in the case of Matlab, things have improved dramatically, as reported on julia-users to be competitive on these implementations.

Instead of changing the existing octave benchmarks, I would suggest to have a new file that implements these benchmarks in a way a current octave user would to get performance. We can have the same for other languages as well. It is possible that Octave will have an LLVM backend in the future (there was a GSOC project, I think), in which case it will also improve on these same benchmarks.

@StefanKarpinski
Copy link
Sponsor Member

Clearly, in the case of Matlab, things have improved dramatically, as reported on julia-users to be competitive on these implementations.

The fact that these "unfair" benchmarks have improved so much in Matlab is a clear market-based indicator that the benchmarks are actually fair – Matlab's customers care enough about the performance of this kind of code that MathWorks spent a lot of money to develop a JIT that can make it faster.

@StefanKarpinski
Copy link
Sponsor Member

That being said, it seems fine to have alternate vectorized implementations of the benchmarks.

@oheim
Copy link
Author

oheim commented Sep 15, 2015

Could you please render a decision (on the mailing list for example), what the micro benchmarks shall be for?

  1. Compare the JIT compiler performance (as the section header and the first two sentences suggest) or
  2. Compare the language performance on particular common problems/algorithms (as the rest of the text suggests).

From a language developers point of view the first option is a valid one, and when this intend is clearly stated in the text, the benchmark results are fair actually.

A user should prefer the second option, because you program to solve problems—you don't program to write loops. (Although some Matlab customers might think differently, as @StefanKarpinski said.)

Maybe the only way out is to have both kind of benchmarks, as suggested by @ViralBShah.

The current JIT compiler in Octave is so trivial that you can safely neglect its existence. On the one hand it is a difficult task to make a decent JIT compiler (like in Julia), on the other hand it is rarely needed when you use a high level language where you don't want to struggle with low level stuff like loops. OT: There are also algorithms where Octave is faster than Matlab, see Table 4 on page 9 in this Open Access journal.

[jiahao: fixed link]

@ViralBShah
Copy link
Member

The original purpose was 1. but over time, people have started asking for 2. as well. It is certainly having both kind. We have to figure out how to present the results.

@JeffBezanson
Copy link
Sponsor Member

Yes, I'm all in favor of expanded, more comprehensive benchmarks.

it is rarely needed when you use a high level language where you don't want to struggle with low level stuff like loops

Fully disagree. First, JIT is beside the point. You can also combine a good static compiler with array syntax, as in fortran 90 and various C++ libraries. Second, array programming does not even give you good performance. While I'm sure your implementations are much faster in octave than what we have, my guess is they're still slower than C or Julia. Third, is that a for loop I see in the mandel and fib code? What's that low-level nonsense doing in there? Also note the repetition of (idx) in mandel. Not terribly convenient. I might counter "you program to solve problems --- you don't program to figure out how to vectorize things".

@tbreloff
Copy link

I think the main point that should be conveyed from the documentation is
that Julia vastly outperforms the other languages in minimizing a cost
function: cost(programmerEffort, programRunTime)

Each problem has a different cost function, and there are many many
problems out there that no one wants to even attempt to implement because
of the complexity... you can write clean, terse Julia code to do something
brand new that has never been done in C or Fortran, and that clean, terse
implementation can be roughly as fast as a highly optimized and messy
implementation in [insert other language]. Julia opens up the boundaries
of what people will work on because you don't need to depend on other
people's libraries to write fast code in a high level language (which is
what happens in vectorized-heavy languages)

On Tue, Sep 15, 2015 at 11:16 AM, Jeff Bezanson notifications@github.com
wrote:

Yes, I'm all in favor of expanded, more comprehensive benchmarks.

it is rarely needed when you use a high level language where you don't
want to struggle with low level stuff like loops

Fully disagree. First, JIT is beside the point. You can also combine a
good static compiler with array syntax, as in fortran 90 and various C++
libraries. Second, array programming does not even give you good
performance. While I'm sure your implementations are much faster in octave
than what we have, my guess is they're still slower than C or Julia. Third,
is that a for loop I see in the mandel and fib code? What's that
low-level nonsense doing in there? Also note the repetition of (idx) in
mandel. Not terribly convenient. I might counter "you program to solve
problems --- you don't program to figure out how to vectorize things".


Reply to this email directly or view it on GitHub
#13042 (comment).

@ScottPJones
Copy link
Contributor

Yes, that's the big takeaway for me also, Julia optimized run-time performance, yes, but these days, the programmer time is a lot more costly, so the fact that I can get good performance much quicker in Julia than in C/C++ etc. when writing new code is the big win. I like @tbreloff 's way of expressing that, will appeal to the technical/scientific community! (minimizing a cost function).

@StefanKarpinski
Copy link
Sponsor Member

I'm not sure how much clearer we can be about the purpose of these benchmarks:

These benchmarks, while not comprehensive, do test compiler performance on a range of common code patterns, such as function calls, string parsing, sorting, numerical loops, random number generation, and array operations. It is important to note that these benchmark implementations are not written for absolute maximal performance (the fastest code to compute fib(20) is the constant literal 6765). Rather, all of the benchmarks are written to test the performance of specific algorithms, expressed in a reasonable idiom in each language. In particular, all languages use the same algorithm: the Fibonacci benchmarks are all recursive while the pi summation benchmarks are all iterative; the “algorithm” for random matrix multiplication is to call LAPACK, except where that’s not possible, such as in JavaScript. The point of these benchmarks is to compare the performance of specific algorithms across language implementations, not to compare the fastest means of computing a result, which in most high-level languages relies on calling C code.

Your argument seems not to be that we're not measuring what we claim to be, but rather that you don't want us to measure and report it on Julia's home page because the results make Octave look bad. I made this somewhat relevant comment here:

Personally, I'm not at all interested in how fast one can compute Fibonacci numbers. Yet that is one of our benchmarks. Why? Because I am very interested in how well languages support recursion – and the doubly recursive algorithm happens to be a great test of recursion, precisely because it is such a terrible way to compute Fibonacci numbers. So what would be learned by comparing an intentionally slow, excessively recursive algorithm in C and Julia against a tricky, clever, vectorized algorithm in R? Nothing at all.

The major failing of these benchmarks seems to be that they compute things that can be computed with vectorized algorithms. There do exist problems that can't be conveniently vectorized, however, and with a little stretch of the imagination you can see that these benchmarks show that those problems will be very slow in Octave whereas they will be as fast as C when written in Julia. This fact has been borne out over and over again as people have ported their iterative or recursive codes from R/Python/Matlab/Octave to Julia and seen 10-1000x speedups.

@kmsquire
Copy link
Member

It just might be that the description is a little verbose. Maybe it would be good to take the highlighted statements bullet points, so they're easy to see, and then add explanation below.

@oheim
Copy link
Author

oheim commented Sep 16, 2015

I'm not sure how much clearer we can be about the purpose of these benchmarks …

After your comments and after reading the text over and over again, I understand that the only purpose was to benchmark the (JIT) compiler performance between various languages. Under this premise, the Octave implementation is perfectly fair.

However, the wording of the text should be changed, because it gets misinterpreted by me and others easily (see #2412 and #5128 for examples). Most prominently is the confusion between “algorithm” and “implementation” in the text. IMHO “algorithm” is a high-level description of solving a particular problem, but the JIT compiler benchmarks shall compare particular “implementations” between languages. The difference is that particular implementations of an algorithm could use short-cuts or different patterns, which a language has to offer. They shall not for these benchmarks, instead each implementation shall use the very same sequence of particular operations.

These benchmarks … test … function calls, string parsing, sorting, numerical loops, random number generation, and array operations.

I'd say yes to function calls, array operations, numerical loops, but testing the other topics would call for different benchmarks.

… specific algorithms, expressed in a reasonable idiom in each language

The particular implementation(!) of an algorithm uses loops and function declarations to the extend the language has to offer, completely suppressing idioms of the language.

all languages use the same algorithm

All languages use the same implementation

The point of these benchmarks is to compare the performance of specific algorithms across language implementations

… specific implementations across languages

The labels in the table also support the confusion between particular problems being solved and the implementation being used: fib should be called recursive_fib, parse_int should be called pare_int_loop, quicksort should be called in-place_quicksort, mandel should be called nested_loop_mandel, pi_sum should be called pi_sum_loop. However, a column header “Implementation” could also server the purpose.

@StefanKarpinski
Copy link
Sponsor Member

I see the confusion now. Unfortunately, as far as I understand it, that isn't correct usage of the word "implementation" and the word "algorithm" means what you consider "implementation" to mean. I do think that we should relabel the benchmarks to give them labels that express what they're testing. For some this is a little hard because they test a few things, but I'm sure we can come up with something.

@jiahao
Copy link
Member

jiahao commented Sep 19, 2015

For the purposes of comparison, I've run the modified benchmarks on our benchmark machine.

Octave 3.8.2 This PR Current
fib 5.40 8867.51
parse_int 13.42 9023.60
quicksort 808.85 1805.14
mandel 24.82 474.33
pi_sum 1.55 282.63
rand_mat_stat 44.15 36.84
rand_mat_mul 1.11 1.13

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants