Review usage of Rayon & improve performance #420

Keats · 2018-09-11T16:26:13Z

It looks like the site loading is done in parallel but the rendering is not: threads are spawned but only one seem used.

TODOs:

review Site::render_section to ensure it uses as many threads as possible efficiently, might be a case of using https://docs.rs/rayon/1.0.2/rayon/iter/trait.IndexedParallelIterator.html#method.with_min_len
review other usages of rayon, maybe some are detrimental to performances.

There are some benches in site/benches but you will need to run the gen.py first to generate some. Given the current speed, the medium-blog and medium-kb are probably the best ones to run.

The text was updated successfully, but these errors were encountered:

Freaky · 2018-09-11T20:08:18Z

Heh. Spent a few minutes on this. Before:

Building site...
load sections 7.057634ms
load pages 7.569582ms
register_early_global_fns 47.949µs
28 pages
render pages 1.235751584s
5 sections
render sections 114.522µs
render_markdown 1.236066107s
populate_sections 4.705176ms
populate_taxonomies 990.847µs
register_tera_global_fns 3.33598ms
-> Creating 28 pages (0 orphan), 4 sections, and processing 0 images
Done in 1.4s.

After:

Building site...
load sections 6.787299ms
load pages 7.347483ms
register_early_global_fns 123.304µs
28 pages
render pages 172.197069ms
5 sections
render sections 505.971µs
render_markdown 172.937422ms
populate_sections 5.612221ms
populate_taxonomies 2.353669ms
register_tera_global_fns 4.012975ms
-> Creating 28 pages (0 orphan), 4 sections, and processing 0 images
Done in 312ms.

This call to par_mut_iter() seems problematic, since this is pretty much all I did:

diff --git components/site/src/lib.rs components/site/src/lib.rs
index 775d534..2ee3f93 100644
--- components/site/src/lib.rs
+++ components/site/src/lib.rs
@@ -271,13 +271,12 @@ impl Site {
             pages_insert_anchors.insert(p.file.path.clone(), self.find_parent_section_insert_anchor(&p.file.parent.clone()));
         }
 
-        self.pages.par_iter_mut()
+        self.pages.iter_mut()
             .map(|(_, page)| {
                 let insert_anchor = pages_insert_anchors[&page.file.path];
                 page.render_markdown(permalinks, tera, config, base_path, insert_anchor)
             })
-            .fold(|| Ok(()), Result::and)
-            .reduce(|| Ok(()), Result::and)?;
+            .collect::<Result<()>>()?;
 
         self.sections.par_iter_mut()
             .map(|(_, section)| section.render_markdown(permalinks, tera, config, base_path))

Keats · 2018-09-12T08:26:05Z

That's interesting :o Just to be sure though: you are not running it in --release mode right? It might change things.

Can you push the branch with timings on? Looks like a good base to experiment on. From what I've seen with more pages (1000-10000), ~95% of the time is spent in https://github.com/Keats/gutenberg/blob/ae7a65b51f3dda4d6789483e930574437c6651e6/components/site/src/lib.rs#L850-L855 writing the pages to disk, which should be pretty fast in theory.

Freaky · 2018-09-12T16:43:28Z

That is release mode, and it's generating my own site. gutenberg build on the blog and kb benchmarks are less impressive but still no worse.

(The actual cargo bench benchmarks seem of dubious value. I stopped the rendering one after the first test ate 2 hours of CPU time and was still going).

This is on a dual hex core Westmere Xeon, so default thread count is 24. A quick once-over of syscall activity suggests a lot of contention and yielding with higher thread counts, but cutting it with RAYON_NUM_THREADS shows negative scaling after just two, and still provides less benefit than just serializing render_page.

Freaky · 2018-09-12T16:44:22Z

I'll push a test branch in a bit, got soup to make and eat first :)

Keats · 2018-09-12T17:16:39Z

I think https://docs.rs/rayon/1.0.2/rayon/iter/trait.IndexedParallelIterator.html#method.with_min_len should help to avoid wasting time parallelizing small things. Rendering 10k pages from a single section however should be done concurrently.
I won't have time to look into that too much this week myself, hopefully next week.

Freaky · 2018-09-12T19:19:21Z

medium-kb (1000 pages):

Baseline: 3.9s
Sequential pages: 3.9s
with_min_len(500): 3.7s
with_min_len(250): 3.3s
with_min_len(125): 3.6s

The scaling is just pants.

Keats · 2018-09-12T19:34:25Z

Something feels wrong, surely with n threads it should be faster than a single one since they don't do any locking... Maybe I'm using rayong wrongly

Freaky · 2018-09-12T21:52:41Z

Tried a quick hack with crossbeam scope and channel and see basically the same thing, with scaling for rendering stopping around 4 and going negative soon after. So whatever the problem it doesn't seem rayon-specific.

Freaky · 2018-09-12T22:39:05Z

Dear me.

components/site/src/lib.rs:165:   208.96µs - get_all_orphan_pages pages_in_sections
components/site/src/lib.rs:172:   692.05ms - get_all_orphan_pages for each page
-> Creating 1000 pages (0 orphan), 10 sections, and processing 0 images
components/site/src/lib.rs:588:   166.38ms - build clean
components/site/src/lib.rs:593:   349.64ms - build render_sections
components/site/src/lib.rs:165:   240.11µs - get_all_orphan_pages pages_in_sections
components/site/src/lib.rs:172:   707.50ms - get_all_orphan_pages for each page
components/site/src/lib.rs:596:   707.88ms - build render_orphan_pages

A build on medium-kb spends about half its time just checking orphans by repeatedly searching a vec. Twice. I dread to think how long it would take on a huge site :/

After replacing with a HashSet:

components/site/src/lib.rs:165:     1.48ms - get_all_orphan_pages pages_in_sections
components/site/src/lib.rs:172:     2.92ms - get_all_orphan_pages for each page
-> Creating 1000 pages (0 orphan), 10 sections, and processing 0 images
components/site/src/lib.rs:588:   169.72ms - build clean
components/site/src/lib.rs:593:   357.97ms - build render_sections
components/site/src/lib.rs:165:     1.51ms - get_all_orphan_pages pages_in_sections
components/site/src/lib.rs:172:     3.05ms - get_all_orphan_pages for each page
components/site/src/lib.rs:596:     4.69ms - build render_orphan_pages

The improvement increases with larger sites.

Freaky · 2018-09-12T22:48:16Z

Heh, huge-kb goes from 150 seconds to 9 :)

Freaky · 2018-09-12T22:55:48Z

Pull request in #424.

Keats · 2018-09-13T16:49:37Z

I guess this can be closed now :o
I should probably still set some min_len to avoid spinning up threads for nothing though.

Keats · 2018-09-13T17:17:29Z

Just saw Freaky@986fda25e66a04a0b6751f305acee341bc0318c8do you want to do a PR with it as well?

Freaky · 2018-09-13T18:05:28Z

Might be worth using dedicated thread pools for IO and rendering, rather than just throwing everything on the global rayon pool.

From what I've seen the IO-bound stuff scales fairly well (at least with SSD/from cache, HDD's might disagree), while rendering bottlenecks quite quickly, at least on my machine.

with_min_len is just going to band-aid specific cases: it'll leave available concurrency on the floor for small sites and will still end up using too much on larger ones. Similarly low-core-count systems could well benefit from having more threads than cores if they're spending much of their time waiting for reads.

Freaky · 2018-09-13T18:23:58Z

PR #427 for the fold/reduce → collect tweak.

Keats · 2018-09-18T17:24:54Z

I did a few more tests and while huge-kb is now very fast to render, big-blog is still slow: 44s on my machine.
Loading the site (Site::load) is super fast even for big sites (1s maybe out of the 44s above) and the rest of the time is spent writing to disk.
43s for writing 1000 files to disk means something is definitely going wrong somewhere. Adding some measurements, it turns out that each pager page rendered by Site::render_paginated takes 800m-2s on my machine. par_iter does work here, it takes 100s otherwise. I couldn't find a particular reason for paginator.render_pager to be slow though, individual parts seem fast enough. That's going to be for another day!

Keats · 2018-09-18T17:28:11Z

I also had a look at replacing some clone() by using Cow in Sections but it wasn't conclusive. It doesn't seem to be a big bottleneck anyway, other than for rebuild of big sites

Freaky · 2018-09-19T02:07:48Z

So, erm. huge-blog.

55788 freaky          25  52    0 19367M 19325M uwait   9  39:21 753.21% gutenberg

It's up and down like a yoyo, peaking at 24GB, dropping to 5GB, then peaking back at 24GB. Over and over. 44 seconds? I killed it after 6 minutes and nearly 2 hours of CPU burnt roughly equally between user and system.

Freaky · 2018-09-19T03:30:40Z

What on earth.

-> Creating 10000 pages (0 orphan), 0 sections, and processing 0 images
render_paginated...
template_name in 6.36s
build_paginator_context in 12.12ms
render_remplate in 1.36s
paginator.render_pager in 8.05s
built in 8.05s

It takes over 6 seconds just to work out what name the template should have?

Freaky · 2018-09-19T04:02:16Z

Right. So. This bit.

        let template_name = match self.root {
            PaginationRoot::Section(s) => {
                context.insert("section", &s);
                s.get_template_name()
            }

from components/pagination/src/lib.rs. That context.insert("section", &s) bit, specifically. That serializes the current section using serde_json. The section with thousands and thousands of pages in it.

If I comment that one line out, huge-blog builds in 39 seconds and peaks at 3.9GB instead of 24.

Keats · 2018-09-19T07:13:04Z

I somehow missed that line yesterday 😱
The section is only there for the metadata so we don't need to have the pages at all in it, i'll make Section::clone_without_page to handle that

Keats · 2018-09-19T07:25:17Z

After doing that (6903975), big-blog goes from 44s to 2.9s. huge-blog goes from crashing my computer to 43s. A good 10-15x improvement but the odd thing is that your computer with tons of cores is only barely faster than my 5 years old laptop with a dual core.

Keats · 2018-09-19T10:02:20Z

Looking at https://forestry.io/blog/hugo-vs-jekyll-benchmark/ it still seems Gutenberg is about 5-10x slower than Hugo but it is at least in the same ballpark now.

Freaky · 2018-09-19T11:49:43Z

There's a lot of noise, but this flame graph looks... interesting.

A lot of time seems to be going into generating backtraces.

Keats · 2018-09-30T22:07:11Z

I removed Paginator::pagers in the next branch and huge-blog is now built in ~10s on my machine.
It looks like #436 isn't really needed though, even with 10k pages it doesn't make a significant difference in my runs.

@Freaky how fast it is to run on your beefy machine?
There are still things to fix imo like cloning of sections/pages (Arc most of the things or a refactoring to use indices instead) but it seems fast enough for the first performance oriented release, thanks to you!

Freaky · 2018-10-01T02:48:08Z

huge-blog vs RAYON_NUM_THREADS:

1: 31.1s
2: 21.7s
4: 16.9s
6: 15.8s
8: 14.7s
12: 14.4s
16: 14.2s
24: 14.3s

Still runs into diminishing returns long before I run out of cores, but the negative scaling seems to have mostly gone.

Peak memory use is now down to 2.7GB even with full 24-thread concurrency - still fairly high, but much better than when we started, which was more like 24GB.

Freaky · 2018-10-01T12:04:56Z

Still seeing negative scaling on my own site:

1: 0.29s
2: 0.28s
4: 0.39s
6: 0.52s
8: 0.64s
12: 0.93s
16: 1.22s
24: 1.36s

Might be worth grabbing other real-world examples and see if this is a common pattern.

Keats · 2018-10-01T16:01:38Z

Whoa those are some big differences.
On the docs:

1: 450ms
2: 370ms
3: 430ms
4: 480ms

My blog shows roughly the same thing as the doc site: it is the fastest at 2 and gets worse and worse after that.
It really looks like we still need to do some tuning with Rayon after all, your site is 5x slower and the docs 25% slower but I'm guessing it takes longer if you have more cores

Keats · 2018-10-01T18:56:29Z

On big-site and huge-kb, RAYON_NUM_THREAD=3 is the best for me. I don't really know what to do there.

Freaky · 2018-10-01T19:25:03Z

Docs:

1: 0.93s
2: 0.75s
4: 0.74s
6: 0.81s
8: 0.90s
12: 1.20s
16: 1.49s
24: 2.10s

Freaky · 2018-10-01T22:06:48Z

https://hur.st/flame/gutenberg-docs-rayon-24-99215.svg

syntect occurs in nearly 62% of the sampled stacks. Nearly 30% in get_highlighter().

Freaky · 2018-10-01T22:34:48Z

40% of the total samples are in syntect::dumps::from_binary().

Instrumenting the thread_local! with different thread counts gives me:

1: 77ms
2: 130ms
4: 310-380ms
6: 450-560ms
8: 590-800ms
12: 880ms-1.09s
24: 1.61s-1.68s

Freaky · 2018-10-01T22:58:53Z

39.2% in Scope::new():

    pub fn new(s: &str) -> Result<Scope, ParseScopeError> {
        let mut repo = SCOPE_REPO.lock().unwrap();
        repo.build(s.trim())
}

Freaky · 2018-10-02T01:03:30Z

A quick hack using syntect master looks promising: SyntaxSet is now Send + Sync and a plain lazy_static! can be used. After rebuilding the dump files it uses, a docs build now looks much better:

1: 0.91s
2: 0.65s
4: 0.53s
6: 0.49s
8: 0.48s
12: 0.47s
16: 0.48s
24: 0.50s

My personal site is much better too. Before:

1: 0.29s
2: 0.29s
4: 0.37s
6: 0.48s
8: 0.61s
12: 0.91s
16: 1.21s
24: 1.38s

After:

1: 0.27s
2: 0.22s
4: 0.19s
6: 0.19s
8: 0.19s
12: 0.20s
16: 0.20s
24: 0.21s

SyntaxSet is now Send + Sync, allowing for a single instance to be shared across all threads. Since initialization was serialized and per-thread previously, this is a significant win on systems with more than a few cores, especially on smaller sites. The regenerated syntax packdump is lacking Elm support due to a regexp compile error. Additional syntax support is also not yet hooked up. This is part of getzola#420's focus on improving performance.

Freaky · 2018-10-02T02:03:41Z

Updated flame graph for comparison purposes: https://hur.st/flame/gutenberg-docs-rayon-24-syntect3.svg

Keats · 2018-10-02T06:32:14Z

That's some great news! I have been following the work on syntect and v3 should be released soon-ish

caemor · 2018-10-02T11:51:19Z

Wow these speedups look amazing <3

Keats · 2018-10-03T14:34:25Z

So I tried to remove the clone by using slotmap on a WIP branch: #459 (gutenberg serve is probably broken on it if content changes)

It is .... actually slower than next right now. Loading is faster but rendering clones way too much, I can't even build huge-blog anymore. I must have messed up something pretty badly as I would have expected the number of clones of Values to be roughly identical to the current Page/Section clone and serialization...

I would appreciate some pairs of eyes on that PR to spot what I am doing wrong.

The code is still very raw and has some pretty bad parts but it builds site correctly and passes all the tests except the ones from the rebuild crate since it might need to be changed quite a bit if loading becomes very cheap.
The idea is that there is Library that holds pages and sections which are referred to via a Key generated by slotmap. Once all the pages/sections are populated with keys, it caches the Value for that page/section meaning that rendering will just clone it without going through the serialization again.

I believe it would be possible to remove a good chunk of clone by passing borrowed Values to Tera::render but that requires to check if it's possible first and then to add the code for it.

Freaky · 2018-10-03T23:29:06Z

Ooh, neat. huge-blog builds here, but with nearly 2x the memory use and 2x the runtime.

Flamegraph: https://hur.st/flame/gutenberg-huge-blog-slotmap-10aba2.svg

Nearly half of runtime's in serde.

Keats · 2018-10-04T08:55:25Z

Yep I get the same results with valgrind, it's all spent cloning Values. I don't really understand why it is copying twice more than before though, I expected it to be the same :/

Freaky · 2018-10-04T18:30:39Z

Well, now instead of going from {Page,Section} -> Value, it's going {Page,Section} -> Value -> Value, no?

Without Tera being able to borrow it, I don't think it's really going to help.

Keats · 2018-10-05T13:09:28Z

@Freaky I've pushed a commit that uses this commit Keats/tera@efb8af8 and we're back to reasonable speed o/
An additional speedup would be to have some way of taking some Vec<&Value> in Tera but that looks very messy. There might still be some silly things in the current code though!

Next step is to clean/document the code and rewrite the rebuild component as it is now ~30x faster to call populate_sections for example.

Freaky · 2018-10-05T13:38:08Z

Yes, that's much better - half the runtime, nearly half the memory use, though latter's slightly higher than the previous baseline - 2.7GB -> 5GB -> 2.9GB.

That's only 1.5% of my main machine, but it's probably worth considering it's also about 75% of a lot of systems, particularly cheap VPS'.

Keats · 2018-10-05T13:44:48Z

That's only 1.5% of my main machine, but it's probably worth considering it's also about 75% of a lot of systems, particularly cheap VPS'.

To be fair you probably don't build a 10k pages site on a VPS. Smashing magazine moved to Hugo and only had 7500 pages: https://discourse.gohugo.io/t/smashing-magazine-s-redesign-powered-by-hugo-jamstack/5826/8

Keats · 2018-10-05T16:59:48Z

Turns out the caching layer is actually completely useless and therefore so is Tera render_with_borrowed, which I should probably remove.

Caching would only help if it was somehow possible to pass Vec<&Value> for taxonomies. It might be possible to cache the SerializedSection to avoid some looping but I'm not sure it would make a big difference.

Keats · 2018-10-06T11:58:36Z

#459 is ready for reviews if anyone has time!

Keats · 2018-10-09T12:36:55Z

Slotmap + syntect 3 are in the next branch.

Freaky · 2018-10-09T13:10:44Z

Excellent. I'm seeing maybe ~5% performance bump on huge-blog, but also a ~13% reduction in memory use.

Syntect savings are as seen in my experimental branch, which is probably the biggest thing for me - 1.4s to 0.2s is quite noticeable :)

Keats · 2018-10-11T13:53:53Z

Latest idea: #480

This would solve some of the repeated serializations we're doing and very often not even using at all but is a slightly worse UX so I'm a bit conflicted.

Keats · 2018-11-16T16:04:42Z

Memory usage is still way too high but that's good enough for the 0.5.0 release tomorrow (hopefully)

tshepang · 2018-11-16T16:35:36Z

What figures are you seeing?

Keats · 2018-11-16T16:40:48Z

To render a blog with 10k pages with taxonomies + pagination + syntax highlighting: around 15s and 3GB of ram.

Keats added enhancement help wanted good first issue labels Sep 11, 2018

Freaky added a commit to Freaky/zola that referenced this issue Sep 12, 2018

Add a bunch of timings for getzola#420

caac719

Keats changed the title ~~Review usage of Rayon~~ Review usage of Rayon & improve performances Sep 18, 2018

Keats changed the title ~~Review usage of Rayon & improve performances~~ Review usage of Rayon & improve performance Sep 18, 2018

Keats closed this as completed Nov 17, 2018

Review usage of Rayon & improve performance #420

Review usage of Rayon & improve performance #420

Comments

Keats commented Sep 11, 2018 • edited

Freaky commented Sep 11, 2018

Keats commented Sep 12, 2018

Freaky commented Sep 12, 2018

Freaky commented Sep 12, 2018

Keats commented Sep 12, 2018

Freaky commented Sep 12, 2018

Keats commented Sep 12, 2018

Freaky commented Sep 12, 2018

Freaky commented Sep 12, 2018

Freaky commented Sep 12, 2018

Freaky commented Sep 12, 2018

Keats commented Sep 13, 2018

Keats commented Sep 13, 2018

Freaky commented Sep 13, 2018

Freaky commented Sep 13, 2018

Keats commented Sep 18, 2018

Keats commented Sep 18, 2018

Freaky commented Sep 19, 2018 • edited

Freaky commented Sep 19, 2018

Freaky commented Sep 19, 2018

Keats commented Sep 19, 2018

Keats commented Sep 19, 2018

Keats commented Sep 19, 2018

Freaky commented Sep 19, 2018

Keats commented Sep 30, 2018 • edited

Freaky commented Oct 1, 2018

Freaky commented Oct 1, 2018

Keats commented Oct 1, 2018

Keats commented Oct 1, 2018

Freaky commented Oct 1, 2018

Freaky commented Oct 1, 2018

Freaky commented Oct 1, 2018

Freaky commented Oct 1, 2018

Freaky commented Oct 2, 2018

Freaky commented Oct 2, 2018

Keats commented Oct 2, 2018

caemor commented Oct 2, 2018

Keats commented Oct 3, 2018

Freaky commented Oct 3, 2018

Keats commented Oct 4, 2018

Freaky commented Oct 4, 2018

Keats commented Oct 5, 2018

Freaky commented Oct 5, 2018

Keats commented Oct 5, 2018

Keats commented Oct 5, 2018

Keats commented Oct 6, 2018

Keats commented Oct 9, 2018 • edited

Freaky commented Oct 9, 2018

Keats commented Oct 11, 2018

Keats commented Nov 16, 2018

tshepang commented Nov 16, 2018

Keats commented Nov 16, 2018

Keats commented Sep 11, 2018 •

edited

Freaky commented Sep 19, 2018 •

edited

Keats commented Sep 30, 2018 •

edited

Keats commented Oct 9, 2018 •

edited