Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

substring_search benchmark is timing out in nightly performance run #917

Closed
stress-tess opened this issue Sep 9, 2021 · 7 comments
Closed
Assignees
Labels
performance Performance needs improving

Comments

@stress-tess
Copy link
Member

stress-tess commented Sep 9, 2021

@ronawho found that the substring_search benchmark is timing out in the nightly performance runs. He's offered to help track down the performance bottleneck later next week. He recommends switching the default problem size from 10**8 to 10**6 for the time being

Here are the times for 10**6 on 16-node-cs-hdr:

>>> arkouda substring search
numLocales = 16, N = 1,000,000
non-regex with literal substring Average time = 0.1250 sec
regex with literal substring Average time = 57.8762 sec
regex with pattern Average time = 57.1415 sec
non-regex with literal substring Average rate = 0.1415 GiB/sec
regex with literal substring Average rate = 0.0003 GiB/sec
regex with pattern Average rate = 0.0003 GiB/sec

ronawho's original comment:
#912 (comment)

@ronawho
Copy link
Contributor

ronawho commented Sep 14, 2021

I haven't been able to dive too deep into this, but here's what I found so far. I ran substring_search on chapcs (single node, comm=none) and I don't see way way better performance there, which makes me suspect this is more about local overheads than inherent comm limitations (though there could be some comm overhead still.)

On chapcs here's the performance I see:

>>> arkouda substring search
numLocales = 1, N = 1,000,000
non-regex with literal substring Average time = 0.0463 sec
regex with literal substring Average time = 9.5469 sec
regex with pattern Average time = 9.4727 sec
non-regex with literal substring Average rate = 0.3823 GiB/sec
regex with literal substring Average rate = 0.0019 GiB/sec
regex with pattern Average rate = 0.0019 GiB/sec

Looking at the current code there are a fair number of memory allocations so I wonder if the overhead is mostly from lots of short-lived allocations similar to what we saw in #266 (comment). The allocations that I saw quickly are the ones to slice the segmented array, and to create the string from the slice.

The below patch eliminates some of the allocations. It's a prototype since it only works for single node, but it should be good enough to do some experiments:

diff --git a/src/CastMsg.chpl b/src/CastMsg.chpl
index 621f09f..54e40d8 100644
--- a/src/CastMsg.chpl
+++ b/src/CastMsg.chpl
@@ -203,7 +203,7 @@ module CastMsg {
               } else {
                    end = oa[i+1] - 1;
               }
-              e = interpretAsString(va[start..end]) : toType;
+              e = interpretAsString(va, start, end-start) : toType;
           }
       } catch e: IllegalArgumentError {
           var errorMsg = "bad value in cast from string to %s".format(toType:string);
diff --git a/src/SegmentedArray.chpl b/src/SegmentedArray.chpl
index ce12360..75ac24c 100644
--- a/src/SegmentedArray.chpl
+++ b/src/SegmentedArray.chpl
@@ -148,7 +148,7 @@ module SegmentedArray {
         end = offsets.a[idx+1] - 1;
       }
       // Take the slice of the bytearray and "cast" it to a chpl string
-      var s = interpretAsString(values.a[start..end]);
+      var s = interpretAsString(values.a, start, end-start);
       return s;
     }

@@ -509,18 +509,18 @@ module SegmentedArray {
         when SearchMode.contains {
           forall (o, l, h) in zip(oa, lengths, hits) with (var myRegex = _unsafeCompileRegex(pattern)) {
             // regexp.search searches the receiving string for matches at any offset
-            h = myRegex.search(interpretAsString(va[o..#l])).matched;
+            h = myRegex.search(interpretAsString(va, o, l)).matched;
           }
         }
         when SearchMode.startsWith {
           forall (o, l, h) in zip(oa, lengths, hits) with (var myRegex = _unsafeCompileRegex(pattern)) {
             // regexp.match only returns a match if the start of the string matches the pattern
-            h = myRegex.match(interpretAsString(va[o..#l])).matched;
+            h = myRegex.match(interpretAsString(va, o, l)).matched;
           }
         }
         when SearchMode.endsWith {
           forall (o, l, h) in zip(oa, lengths, hits) with (var myRegex = _unsafeCompileRegex(pattern)) {
-            var matches = myRegex.matches(interpretAsString(va[o..#l]));
+            var matches = myRegex.matches(interpretAsString(va, o, l));
             var lastMatch: reMatch = matches[matches.size-1][0];
             // h = true iff start(lastMatch) + len(lastMatch) == len(string) (-1 to account for null byte)
             h = lastMatch.offset + lastMatch.size == l-1;
@@ -531,7 +531,7 @@ module SegmentedArray {
             // regexp.match only returns a match if the start of the string matches the pattern
             // h = true iff len(match) == len(string) (-1 to account for null byte)
             // if no match is found reMatch.size returns -1
-            h = myRegex.match(interpretAsString(va[o..#l])).size == l-1;
+            h = myRegex.match(interpretAsString(va, o, l)).size == l-1;
           }
         }
       }
@@ -638,7 +638,7 @@ module SegmentedArray {
       var rightStart: [offsets.aD] int;

       forall (o, len, i) in zip(oa, lengths, offsets.aD) with (var myRegex = _unsafeCompileRegex(delimiter)) {
-        var matches = myRegex.matches(interpretAsString(va[o..#len]));
+        var matches = myRegex.matches(interpretAsString(va, o, len));
         if matches.size < times {
           // not enough occurances of delim, the entire string stays together, and the param args
           // determine whether it ends up on the left or right
@@ -1269,19 +1269,16 @@ module SegmentedArray {
   }

   /* Convert an array of raw bytes into a Chapel string. */
-  inline proc interpretAsString(bytearray: [?D] uint(8)): string {
+  inline proc interpretAsString(bytearray: [?D] uint(8), start, size): string {
     // Byte buffer must be local in order to make a C pointer
-    var localBytes: [{0..#D.size}] uint(8) = bytearray;
-    var cBytes = c_ptrTo(localBytes);
+    var cBytes = c_ptrTo(bytearray[start]);
     // Byte buffer is null-terminated, so length is buffer.size - 1
     // The contents of the buffer should be copied out because cBytes will go out of scope
     // var s = new string(cBytes, D.size-1, D.size, isowned=false, needToCopy=true);
-    var s: string;
     try {
-      s = createStringWithNewBuffer(cBytes, D.size-1, D.size);
+      return createStringWithBorrowedBuffer(cBytes, size-1, size);
     } catch {
-      s = "<error interpreting bytes as string>";
+      return  "<error interpreting bytes as string>";
     }
-    return s;
   }
 }

That basically tries to directly reinterpret pointers into the segmented array as strings without copies (and will break if the index into the segmented array is remote for instance.) Here's the performance I see with that on chapcs:

>>> arkouda substring search
numLocales = 1, N = 1,000,000
non-regex with literal substring Average time = 0.0464 sec
regex with literal substring Average time = 1.2099 sec
regex with pattern Average time = 1.2219 sec
non-regex with literal substring Average rate = 0.3816 GiB/sec
regex with literal substring Average rate = 0.0146 GiB/sec
regex with pattern Average rate = 0.0145 GiB/sec

This shows some promise, but there's probably another copy that I missed in there or something.

@ronawho
Copy link
Contributor

ronawho commented Sep 14, 2021

Hmm, the remaining allocation is coming from the regex module here. Using some hacks to eliminate that allocation here's what I see:

>>> arkouda substring search
numLocales = 1, N = 1,000,000
non-regex with literal substring Average time = 0.0482 sec
regex with literal substring Average time = 0.0123 sec
regex with pattern Average time = 0.0130 sec
non-regex with literal substring Average rate = 0.3668 GiB/sec
regex with literal substring Average rate = 1.4349 GiB/sec
regex with pattern Average rate = 1.3637 GiB/sec

I don't see any extra allocations at this point. I'm not positive, but there may be some other micro-optimizations to avoid doing utf8 validation when creating the strings and things like that.

@ronawho
Copy link
Contributor

ronawho commented Sep 14, 2021

Misc related notes:

  • The benchmark takes a while to start, I believe almost all this time is spent creating digit_substr and especially test_substring. Checking for --size=$((10**7)) it takes 4s to create digit_substr and 100s to create test_substring
  • I think the benchmark plot is wrong. In https://chapel-lang.org/perf/arkouda/chapcs/?startdate=2021/09/11&enddate=2021/09/14&graphs=substringsearchperformance it looks like non-regex with literal substring GiB/s and regex with literal substring GiB/s have the exact same value in the nightly graph, but in the above results the regex version is a lot slower. In the chapel testing infrastructure code that plucks values out for specific keys we're probably doing the wrong type of regex and the search for regex with literal substring GiB/s ends up matching against the non-regex with literal substring GiB/s output, so we're probably grabbing that value twice. I'll look into that.

@ronawho ronawho added the performance Performance needs improving label Sep 15, 2021
@ronawho
Copy link
Contributor

ronawho commented Sep 16, 2021

@pierce314159 @reuster986 Just FYI my high level plan for this is:

  • Fix the chapel benchmarking scripts to correct the benchmark plot mentioned in #917 (comment) so that we collect the right regex with literal substring GiB/s values.
  • Minimize allocations in Arkouda by creating strings that point to existing buffers instead of copying. This is extending #917 (comment) to include ideas discussed in #921 (comment). We'll have to localize strings if they're remote, but I'm hoping to reuse some of the code used for SipHash to do this.
  • Update the Chapel re2 module to minimize allocations as discussed in #917 (comment). Since these are changes to Chapel we may want to drop a copy of the regex module into the Arkouda source so that you can take advantage of these optimizations before our 1.26 release in ~6 months. This is a little unfortunate, but I don't see an easier way at the moment.
  • After this work is done, increase the nightly problem size back to 10**8 and evaluate multi-locale performance to see if more work is needed there

Mentioned before, but we're in the middle of our 1.25 release so I'll be a bit delayed getting to this. Hoping to tackle in the next 1-2 weeks but don't hold me to that.

@pierce314159 Could you take a look at:

It also looks like #920 is adding a interpretAsBytes, so we'll probably want to make similar optimizations there.

stress-tess pushed a commit to stress-tess/arkouda that referenced this issue Sep 20, 2021
…_search` benchmark to allow use of `ak.stick` and speed up `test_substring` construction time
stress-tess pushed a commit to stress-tess/arkouda that referenced this issue Sep 20, 2021
…_search` benchmark to allow use of `ak.stick` and speed up `test_substring` construction time
stress-tess pushed a commit to stress-tess/arkouda that referenced this issue Sep 20, 2021
…_search` benchmark to allow use of `ak.stick` and speed up `test_substring` construction time
@stress-tess
Copy link
Member Author

@ronawho, I put up PR #926, which should take care of the slow setup for digit_substr and test_substring

@ronawho
Copy link
Contributor

ronawho commented Nov 19, 2021

Allocations were minimized in chapel-lang/chapel#18733 and there are no allocations for things like contains now. At the moment we can't really increase the nightly problem size since the version that tests against chapel 1.25.0 will still be slow. We're releasing a 1.25.1 shortly after thanksgiving and I think after nightly testing is using that we can bump problem size.

@ronawho
Copy link
Contributor

ronawho commented Jan 20, 2022

substring_search was heavily optimized between #935, chapel-lang/chapel#18465, #931, chapel-lang/chapel#18733, so I'm bumping the problem size back up in #1038.

Here's performance on 16-node-cs-hdr with 10**8:

>>> arkouda substring search
numLocales = 16, N = 100,000,000
non-regex with literal substring Average time = 0.2461 sec
regex with literal substring Average time = 0.0772 sec
regex with pattern Average time = 0.0422 sec
non-regex with literal substring Average rate = 7.1897 GiB/sec
regex with literal substring Average rate = 22.9269 GiB/sec
regex with pattern Average rate = 41.9599 GiB/sec

@ronawho ronawho closed this as completed Jan 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance needs improving
Projects
None yet
Development

No branches or pull requests

2 participants