`substring_search` benchmark is timing out in nightly performance run #917

stress-tess · 2021-09-09T15:13:31Z

@ronawho found that the substring_search benchmark is timing out in the nightly performance runs. He's offered to help track down the performance bottleneck later next week. He recommends switching the default problem size from 10**8 to 10**6 for the time being

Here are the times for 10**6 on 16-node-cs-hdr:

>>> arkouda substring search
numLocales = 16, N = 1,000,000
non-regex with literal substring Average time = 0.1250 sec
regex with literal substring Average time = 57.8762 sec
regex with pattern Average time = 57.1415 sec
non-regex with literal substring Average rate = 0.1415 GiB/sec
regex with literal substring Average rate = 0.0003 GiB/sec
regex with pattern Average rate = 0.0003 GiB/sec

ronawho's original comment:
#912 (comment)

The text was updated successfully, but these errors were encountered:

…em size from 10**8 to 10**6

Issue #917: Change `substring_search` benchmark problem size to `10**6`

ronawho · 2021-09-14T22:55:32Z

I haven't been able to dive too deep into this, but here's what I found so far. I ran substring_search on chapcs (single node, comm=none) and I don't see way way better performance there, which makes me suspect this is more about local overheads than inherent comm limitations (though there could be some comm overhead still.)

On chapcs here's the performance I see:

>>> arkouda substring search
numLocales = 1, N = 1,000,000
non-regex with literal substring Average time = 0.0463 sec
regex with literal substring Average time = 9.5469 sec
regex with pattern Average time = 9.4727 sec
non-regex with literal substring Average rate = 0.3823 GiB/sec
regex with literal substring Average rate = 0.0019 GiB/sec
regex with pattern Average rate = 0.0019 GiB/sec

Looking at the current code there are a fair number of memory allocations so I wonder if the overhead is mostly from lots of short-lived allocations similar to what we saw in #266 (comment). The allocations that I saw quickly are the ones to slice the segmented array, and to create the string from the slice.

The below patch eliminates some of the allocations. It's a prototype since it only works for single node, but it should be good enough to do some experiments:

diff --git a/src/CastMsg.chpl b/src/CastMsg.chpl
index 621f09f..54e40d8 100644
--- a/src/CastMsg.chpl
+++ b/src/CastMsg.chpl
@@ -203,7 +203,7 @@ module CastMsg {
               } else {
                    end = oa[i+1] - 1;
               }
-              e = interpretAsString(va[start..end]) : toType;
+              e = interpretAsString(va, start, end-start) : toType;
           }
       } catch e: IllegalArgumentError {
           var errorMsg = "bad value in cast from string to %s".format(toType:string);
diff --git a/src/SegmentedArray.chpl b/src/SegmentedArray.chpl
index ce12360..75ac24c 100644
--- a/src/SegmentedArray.chpl
+++ b/src/SegmentedArray.chpl
@@ -148,7 +148,7 @@ module SegmentedArray {
         end = offsets.a[idx+1] - 1;
       }
       // Take the slice of the bytearray and "cast" it to a chpl string
-      var s = interpretAsString(values.a[start..end]);
+      var s = interpretAsString(values.a, start, end-start);
       return s;
     }

@@ -509,18 +509,18 @@ module SegmentedArray {
         when SearchMode.contains {
           forall (o, l, h) in zip(oa, lengths, hits) with (var myRegex = _unsafeCompileRegex(pattern)) {
             // regexp.search searches the receiving string for matches at any offset
-            h = myRegex.search(interpretAsString(va[o..#l])).matched;
+            h = myRegex.search(interpretAsString(va, o, l)).matched;
           }
         }
         when SearchMode.startsWith {
           forall (o, l, h) in zip(oa, lengths, hits) with (var myRegex = _unsafeCompileRegex(pattern)) {
             // regexp.match only returns a match if the start of the string matches the pattern
-            h = myRegex.match(interpretAsString(va[o..#l])).matched;
+            h = myRegex.match(interpretAsString(va, o, l)).matched;
           }
         }
         when SearchMode.endsWith {
           forall (o, l, h) in zip(oa, lengths, hits) with (var myRegex = _unsafeCompileRegex(pattern)) {
-            var matches = myRegex.matches(interpretAsString(va[o..#l]));
+            var matches = myRegex.matches(interpretAsString(va, o, l));
             var lastMatch: reMatch = matches[matches.size-1][0];
             // h = true iff start(lastMatch) + len(lastMatch) == len(string) (-1 to account for null byte)
             h = lastMatch.offset + lastMatch.size == l-1;
@@ -531,7 +531,7 @@ module SegmentedArray {
             // regexp.match only returns a match if the start of the string matches the pattern
             // h = true iff len(match) == len(string) (-1 to account for null byte)
             // if no match is found reMatch.size returns -1
-            h = myRegex.match(interpretAsString(va[o..#l])).size == l-1;
+            h = myRegex.match(interpretAsString(va, o, l)).size == l-1;
           }
         }
       }
@@ -638,7 +638,7 @@ module SegmentedArray {
       var rightStart: [offsets.aD] int;

       forall (o, len, i) in zip(oa, lengths, offsets.aD) with (var myRegex = _unsafeCompileRegex(delimiter)) {
-        var matches = myRegex.matches(interpretAsString(va[o..#len]));
+        var matches = myRegex.matches(interpretAsString(va, o, len));
         if matches.size < times {
           // not enough occurances of delim, the entire string stays together, and the param args
           // determine whether it ends up on the left or right
@@ -1269,19 +1269,16 @@ module SegmentedArray {
   }

   /* Convert an array of raw bytes into a Chapel string. */
-  inline proc interpretAsString(bytearray: [?D] uint(8)): string {
+  inline proc interpretAsString(bytearray: [?D] uint(8), start, size): string {
     // Byte buffer must be local in order to make a C pointer
-    var localBytes: [{0..#D.size}] uint(8) = bytearray;
-    var cBytes = c_ptrTo(localBytes);
+    var cBytes = c_ptrTo(bytearray[start]);
     // Byte buffer is null-terminated, so length is buffer.size - 1
     // The contents of the buffer should be copied out because cBytes will go out of scope
     // var s = new string(cBytes, D.size-1, D.size, isowned=false, needToCopy=true);
-    var s: string;
     try {
-      s = createStringWithNewBuffer(cBytes, D.size-1, D.size);
+      return createStringWithBorrowedBuffer(cBytes, size-1, size);
     } catch {
-      s = "<error interpreting bytes as string>";
+      return  "<error interpreting bytes as string>";
     }
-    return s;
   }
 }

That basically tries to directly reinterpret pointers into the segmented array as strings without copies (and will break if the index into the segmented array is remote for instance.) Here's the performance I see with that on chapcs:

>>> arkouda substring search
numLocales = 1, N = 1,000,000
non-regex with literal substring Average time = 0.0464 sec
regex with literal substring Average time = 1.2099 sec
regex with pattern Average time = 1.2219 sec
non-regex with literal substring Average rate = 0.3816 GiB/sec
regex with literal substring Average rate = 0.0146 GiB/sec
regex with pattern Average rate = 0.0145 GiB/sec

This shows some promise, but there's probably another copy that I missed in there or something.

ronawho · 2021-09-14T23:05:34Z

Hmm, the remaining allocation is coming from the regex module here. Using some hacks to eliminate that allocation here's what I see:

>>> arkouda substring search
numLocales = 1, N = 1,000,000
non-regex with literal substring Average time = 0.0482 sec
regex with literal substring Average time = 0.0123 sec
regex with pattern Average time = 0.0130 sec
non-regex with literal substring Average rate = 0.3668 GiB/sec
regex with literal substring Average rate = 1.4349 GiB/sec
regex with pattern Average rate = 1.3637 GiB/sec

I don't see any extra allocations at this point. I'm not positive, but there may be some other micro-optimizations to avoid doing utf8 validation when creating the strings and things like that.

ronawho · 2021-09-14T23:18:20Z

Misc related notes:

The benchmark takes a while to start, I believe almost all this time is spent creating digit_substr and especially test_substring. Checking for --size=$((10**7)) it takes 4s to create digit_substr and 100s to create test_substring
I think the benchmark plot is wrong. In https://chapel-lang.org/perf/arkouda/chapcs/?startdate=2021/09/11&enddate=2021/09/14&graphs=substringsearchperformance it looks like non-regex with literal substring GiB/s and regex with literal substring GiB/s have the exact same value in the nightly graph, but in the above results the regex version is a lot slower. In the chapel testing infrastructure code that plucks values out for specific keys we're probably doing the wrong type of regex and the search for regex with literal substring GiB/s ends up matching against the non-regex with literal substring GiB/s output, so we're probably grabbing that value twice. I'll look into that.

ronawho · 2021-09-16T23:32:07Z

@pierce314159 @reuster986 Just FYI my high level plan for this is:

Fix the chapel benchmarking scripts to correct the benchmark plot mentioned in #917 (comment) so that we collect the right regex with literal substring GiB/s values.
Minimize allocations in Arkouda by creating strings that point to existing buffers instead of copying. This is extending #917 (comment) to include ideas discussed in #921 (comment). We'll have to localize strings if they're remote, but I'm hoping to reuse some of the code used for SipHash to do this.
Update the Chapel re2 module to minimize allocations as discussed in #917 (comment). Since these are changes to Chapel we may want to drop a copy of the regex module into the Arkouda source so that you can take advantage of these optimizations before our 1.26 release in ~6 months. This is a little unfortunate, but I don't see an easier way at the moment.
After this work is done, increase the nightly problem size back to 10**8 and evaluate multi-locale performance to see if more work is needed there

Mentioned before, but we're in the middle of our 1.25 release so I'll be a bit delayed getting to this. Hoping to tackle in the next 1-2 weeks but don't hold me to that.

@pierce314159 Could you take a look at:

speeding up the benchmark startup. substring_search benchmark is timing out in nightly performance run #917 (comment) has some info on that.

It also looks like #920 is adding a interpretAsBytes, so we'll probably want to make similar optimizations there.

…_search` benchmark to allow use of `ak.stick` and speed up `test_substring` construction time

stress-tess · 2021-09-20T19:48:10Z

@ronawho, I put up PR #926, which should take care of the slow setup for digit_substr and test_substring

Issue #917: Speedup `substring_search` benchmark startup

ronawho · 2021-11-19T21:46:11Z

Allocations were minimized in chapel-lang/chapel#18733 and there are no allocations for things like contains now. At the moment we can't really increase the nightly problem size since the version that tests against chapel 1.25.0 will still be slow. We're releasing a 1.25.1 shortly after thanksgiving and I think after nightly testing is using that we can bump problem size.

ronawho · 2022-01-20T17:09:44Z

substring_search was heavily optimized between #935, chapel-lang/chapel#18465, #931, chapel-lang/chapel#18733, so I'm bumping the problem size back up in #1038.

Here's performance on 16-node-cs-hdr with 10**8:

>>> arkouda substring search
numLocales = 16, N = 100,000,000
non-regex with literal substring Average time = 0.2461 sec
regex with literal substring Average time = 0.0772 sec
regex with pattern Average time = 0.0422 sec
non-regex with literal substring Average rate = 7.1897 GiB/sec
regex with literal substring Average rate = 22.9269 GiB/sec
regex with pattern Average rate = 41.9599 GiB/sec

stress-tess mentioned this issue Sep 9, 2021

Closes #909 Add regex support to Strings search methods and peel: #912

Merged

stress-tess pushed a commit to stress-tess/arkouda that referenced this issue Sep 9, 2021

Issue Bears-R-Us#917: Change substring_search benchmark default probl…

254198b

…em size from 10**8 to 10**6

stress-tess mentioned this issue Sep 9, 2021

Issue #917: Change substring_search benchmark problem size to 10**6 #918

Merged

stress-tess added a commit that referenced this issue Sep 10, 2021

Merge pull request #918 from pierce314159/917_substring_search_perf

9335039

Issue #917: Change `substring_search` benchmark problem size to `10**6`

stress-tess pushed a commit to stress-tess/arkouda that referenced this issue Sep 14, 2021

Issue Bears-R-Us#917: Add aggregators to regex substring search methods

7ade753

stress-tess mentioned this issue Sep 14, 2021

Issue #917: Adds aggregators to regex substring search methods #921

Closed

ronawho added the performance Performance needs improving label Sep 15, 2021

stress-tess pushed a commit to stress-tess/arkouda that referenced this issue Sep 20, 2021

Issue Bears-R-Us#917: Switches to fixed digit_substr for `substring…

8151206

…_search` benchmark to allow use of `ak.stick` and speed up `test_substring` construction time

stress-tess pushed a commit to stress-tess/arkouda that referenced this issue Sep 20, 2021

Issue Bears-R-Us#917: Switches to fixed digit_substr for `substring…

fc5035a

…_search` benchmark to allow use of `ak.stick` and speed up `test_substring` construction time

stress-tess pushed a commit to stress-tess/arkouda that referenced this issue Sep 20, 2021

Issue Bears-R-Us#917: Switches to fixed digit_substr for `substring…

32af935

…_search` benchmark to allow use of `ak.stick` and speed up `test_substring` construction time

stress-tess mentioned this issue Sep 20, 2021

Issue #917: Speedup substring_search benchmark startup #926

Merged

stress-tess added a commit that referenced this issue Sep 21, 2021

Merge pull request #926 from pierce314159/917_substring_search_perf

795c8cb

Issue #917: Speedup `substring_search` benchmark startup

This was referenced Sep 25, 2021

Move localizing code from SipHash to a general helper #928

Merged

Concurrent memory allocation is slowed by memTrack #929

Closed

flatten benchmark is timing out in nightly performance testing #930

Closed

ronawho self-assigned this Sep 27, 2021

This was referenced Sep 27, 2021

Optimize how segmented array slices are interpreted as strings/bytes #931

Merged

Handle Regex deprecation warnings with Chapel 1.25.0 #939

Closed

Regex optimizations #940

Closed

ronawho mentioned this issue Jan 20, 2022

Bump substring_search problem size back to 10**8 #1038

Merged

ronawho closed this as completed Jan 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`substring_search` benchmark is timing out in nightly performance run #917

`substring_search` benchmark is timing out in nightly performance run #917

stress-tess commented Sep 9, 2021 •

edited

Loading

ronawho commented Sep 14, 2021 •

edited

Loading

ronawho commented Sep 14, 2021

ronawho commented Sep 14, 2021

ronawho commented Sep 16, 2021 •

edited

Loading

stress-tess commented Sep 20, 2021

ronawho commented Nov 19, 2021

ronawho commented Jan 20, 2022

substring_search benchmark is timing out in nightly performance run #917

substring_search benchmark is timing out in nightly performance run #917

Comments

stress-tess commented Sep 9, 2021 • edited Loading

ronawho commented Sep 14, 2021 • edited Loading

ronawho commented Sep 14, 2021

ronawho commented Sep 14, 2021

ronawho commented Sep 16, 2021 • edited Loading

stress-tess commented Sep 20, 2021

ronawho commented Nov 19, 2021

ronawho commented Jan 20, 2022

`substring_search` benchmark is timing out in nightly performance run #917

`substring_search` benchmark is timing out in nightly performance run #917

stress-tess commented Sep 9, 2021 •

edited

Loading

ronawho commented Sep 14, 2021 •

edited

Loading

ronawho commented Sep 16, 2021 •

edited

Loading