Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review usage of Rayon & improve performance #420

Closed
Keats opened this issue Sep 11, 2018 · 74 comments
Closed

Review usage of Rayon & improve performance #420

Keats opened this issue Sep 11, 2018 · 74 comments

Comments

@Keats
Copy link
Collaborator

Keats commented Sep 11, 2018

It looks like the site loading is done in parallel but the rendering is not: threads are spawned but only one seem used.

TODOs:

There are some benches in site/benches but you will need to run the gen.py first to generate some. Given the current speed, the medium-blog and medium-kb are probably the best ones to run.

@Freaky
Copy link
Contributor

Freaky commented Sep 11, 2018

Heh. Spent a few minutes on this. Before:

Building site...
load sections 7.057634ms
load pages 7.569582ms
register_early_global_fns 47.949µs
28 pages
render pages 1.235751584s
5 sections
render sections 114.522µs
render_markdown 1.236066107s
populate_sections 4.705176ms
populate_taxonomies 990.847µs
register_tera_global_fns 3.33598ms
-> Creating 28 pages (0 orphan), 4 sections, and processing 0 images
Done in 1.4s.

After:

Building site...
load sections 6.787299ms
load pages 7.347483ms
register_early_global_fns 123.304µs
28 pages
render pages 172.197069ms
5 sections
render sections 505.971µs
render_markdown 172.937422ms
populate_sections 5.612221ms
populate_taxonomies 2.353669ms
register_tera_global_fns 4.012975ms
-> Creating 28 pages (0 orphan), 4 sections, and processing 0 images
Done in 312ms.

This call to par_mut_iter() seems problematic, since this is pretty much all I did:

diff --git components/site/src/lib.rs components/site/src/lib.rs
index 775d534..2ee3f93 100644
--- components/site/src/lib.rs
+++ components/site/src/lib.rs
@@ -271,13 +271,12 @@ impl Site {
             pages_insert_anchors.insert(p.file.path.clone(), self.find_parent_section_insert_anchor(&p.file.parent.clone()));
         }
 
-        self.pages.par_iter_mut()
+        self.pages.iter_mut()
             .map(|(_, page)| {
                 let insert_anchor = pages_insert_anchors[&page.file.path];
                 page.render_markdown(permalinks, tera, config, base_path, insert_anchor)
             })
-            .fold(|| Ok(()), Result::and)
-            .reduce(|| Ok(()), Result::and)?;
+            .collect::<Result<()>>()?;
 
         self.sections.par_iter_mut()
             .map(|(_, section)| section.render_markdown(permalinks, tera, config, base_path))

@Keats
Copy link
Collaborator Author

Keats commented Sep 12, 2018

That's interesting :o Just to be sure though: you are not running it in --release mode right? It might change things.

Can you push the branch with timings on? Looks like a good base to experiment on. From what I've seen with more pages (1000-10000), ~95% of the time is spent in https://github.com/Keats/gutenberg/blob/ae7a65b51f3dda4d6789483e930574437c6651e6/components/site/src/lib.rs#L850-L855 writing the pages to disk, which should be pretty fast in theory.

@Freaky
Copy link
Contributor

Freaky commented Sep 12, 2018

That is release mode, and it's generating my own site. gutenberg build on the blog and kb benchmarks are less impressive but still no worse.

(The actual cargo bench benchmarks seem of dubious value. I stopped the rendering one after the first test ate 2 hours of CPU time and was still going).

This is on a dual hex core Westmere Xeon, so default thread count is 24. A quick once-over of syscall activity suggests a lot of contention and yielding with higher thread counts, but cutting it with RAYON_NUM_THREADS shows negative scaling after just two, and still provides less benefit than just serializing render_page.

@Freaky
Copy link
Contributor

Freaky commented Sep 12, 2018

I'll push a test branch in a bit, got soup to make and eat first :)

@Keats
Copy link
Collaborator Author

Keats commented Sep 12, 2018

I think https://docs.rs/rayon/1.0.2/rayon/iter/trait.IndexedParallelIterator.html#method.with_min_len should help to avoid wasting time parallelizing small things. Rendering 10k pages from a single section however should be done concurrently.
I won't have time to look into that too much this week myself, hopefully next week.

@Freaky
Copy link
Contributor

Freaky commented Sep 12, 2018

medium-kb (1000 pages):

  • Baseline: 3.9s
  • Sequential pages: 3.9s
  • with_min_len(500): 3.7s
  • with_min_len(250): 3.3s
  • with_min_len(125): 3.6s

The scaling is just pants.

@Keats
Copy link
Collaborator Author

Keats commented Sep 12, 2018

Something feels wrong, surely with n threads it should be faster than a single one since they don't do any locking... Maybe I'm using rayong wrongly

Freaky added a commit to Freaky/zola that referenced this issue Sep 12, 2018
@Freaky
Copy link
Contributor

Freaky commented Sep 12, 2018

Tried a quick hack with crossbeam scope and channel and see basically the same thing, with scaling for rendering stopping around 4 and going negative soon after. So whatever the problem it doesn't seem rayon-specific.

@Freaky
Copy link
Contributor

Freaky commented Sep 12, 2018

Dear me.

components/site/src/lib.rs:165:   208.96µs - get_all_orphan_pages pages_in_sections
components/site/src/lib.rs:172:   692.05ms - get_all_orphan_pages for each page
-> Creating 1000 pages (0 orphan), 10 sections, and processing 0 images
components/site/src/lib.rs:588:   166.38ms - build clean
components/site/src/lib.rs:593:   349.64ms - build render_sections
components/site/src/lib.rs:165:   240.11µs - get_all_orphan_pages pages_in_sections
components/site/src/lib.rs:172:   707.50ms - get_all_orphan_pages for each page
components/site/src/lib.rs:596:   707.88ms - build render_orphan_pages

A build on medium-kb spends about half its time just checking orphans by repeatedly searching a vec. Twice. I dread to think how long it would take on a huge site :/

After replacing with a HashSet:

components/site/src/lib.rs:165:     1.48ms - get_all_orphan_pages pages_in_sections
components/site/src/lib.rs:172:     2.92ms - get_all_orphan_pages for each page
-> Creating 1000 pages (0 orphan), 10 sections, and processing 0 images
components/site/src/lib.rs:588:   169.72ms - build clean
components/site/src/lib.rs:593:   357.97ms - build render_sections
components/site/src/lib.rs:165:     1.51ms - get_all_orphan_pages pages_in_sections
components/site/src/lib.rs:172:     3.05ms - get_all_orphan_pages for each page
components/site/src/lib.rs:596:     4.69ms - build render_orphan_pages

The improvement increases with larger sites.

@Freaky
Copy link
Contributor

Freaky commented Sep 12, 2018

Heh, huge-kb goes from 150 seconds to 9 :)

@Freaky
Copy link
Contributor

Freaky commented Sep 12, 2018

Pull request in #424.

@Keats
Copy link
Collaborator Author

Keats commented Sep 13, 2018

I guess this can be closed now :o
I should probably still set some min_len to avoid spinning up threads for nothing though.

@Keats
Copy link
Collaborator Author

Keats commented Sep 13, 2018

Just saw Freaky@986fda25e66a04a0b6751f305acee341bc0318c8do you want to do a PR with it as well?

@Freaky
Copy link
Contributor

Freaky commented Sep 13, 2018

Might be worth using dedicated thread pools for IO and rendering, rather than just throwing everything on the global rayon pool.

From what I've seen the IO-bound stuff scales fairly well (at least with SSD/from cache, HDD's might disagree), while rendering bottlenecks quite quickly, at least on my machine.

with_min_len is just going to band-aid specific cases: it'll leave available concurrency on the floor for small sites and will still end up using too much on larger ones. Similarly low-core-count systems could well benefit from having more threads than cores if they're spending much of their time waiting for reads.

@Freaky
Copy link
Contributor

Freaky commented Sep 13, 2018

PR #427 for the fold/reduce → collect tweak.

@Keats
Copy link
Collaborator Author

Keats commented Sep 18, 2018

I did a few more tests and while huge-kb is now very fast to render, big-blog is still slow: 44s on my machine.
Loading the site (Site::load) is super fast even for big sites (1s maybe out of the 44s above) and the rest of the time is spent writing to disk.
43s for writing 1000 files to disk means something is definitely going wrong somewhere. Adding some measurements, it turns out that each pager page rendered by Site::render_paginated takes 800m-2s on my machine. par_iter does work here, it takes 100s otherwise. I couldn't find a particular reason for paginator.render_pager to be slow though, individual parts seem fast enough. That's going to be for another day!

@Keats Keats changed the title Review usage of Rayon Review usage of Rayon & improve performances Sep 18, 2018
@Keats
Copy link
Collaborator Author

Keats commented Sep 18, 2018

I also had a look at replacing some clone() by using Cow in Sections but it wasn't conclusive. It doesn't seem to be a big bottleneck anyway, other than for rebuild of big sites

@Keats Keats changed the title Review usage of Rayon & improve performances Review usage of Rayon & improve performance Sep 18, 2018
@Freaky
Copy link
Contributor

Freaky commented Sep 19, 2018

So, erm. huge-blog.

55788 freaky          25  52    0 19367M 19325M uwait   9  39:21 753.21% gutenberg 

It's up and down like a yoyo, peaking at 24GB, dropping to 5GB, then peaking back at 24GB. Over and over. 44 seconds? I killed it after 6 minutes and nearly 2 hours of CPU burnt roughly equally between user and system.

@Freaky
Copy link
Contributor

Freaky commented Sep 19, 2018

What on earth.

-> Creating 10000 pages (0 orphan), 0 sections, and processing 0 images
render_paginated...
template_name in 6.36s
build_paginator_context in 12.12ms
render_remplate in 1.36s
paginator.render_pager in 8.05s
built in 8.05s

It takes over 6 seconds just to work out what name the template should have?

@Freaky
Copy link
Contributor

Freaky commented Sep 19, 2018

Right. So. This bit.

        let template_name = match self.root {
            PaginationRoot::Section(s) => {
                context.insert("section", &s);
                s.get_template_name()
            }

from components/pagination/src/lib.rs. That context.insert("section", &s) bit, specifically. That serializes the current section using serde_json. The section with thousands and thousands of pages in it.

If I comment that one line out, huge-blog builds in 39 seconds and peaks at 3.9GB instead of 24.

@Keats
Copy link
Collaborator Author

Keats commented Sep 19, 2018

I somehow missed that line yesterday 😱
The section is only there for the metadata so we don't need to have the pages at all in it, i'll make Section::clone_without_page to handle that

@Keats
Copy link
Collaborator Author

Keats commented Sep 19, 2018

After doing that (6903975), big-blog goes from 44s to 2.9s. huge-blog goes from crashing my computer to 43s. A good 10-15x improvement but the odd thing is that your computer with tons of cores is only barely faster than my 5 years old laptop with a dual core.

@Keats
Copy link
Collaborator Author

Keats commented Sep 19, 2018

Looking at https://forestry.io/blog/hugo-vs-jekyll-benchmark/ it still seems Gutenberg is about 5-10x slower than Hugo but it is at least in the same ballpark now.

@Freaky
Copy link
Contributor

Freaky commented Sep 19, 2018

There's a lot of noise, but this flame graph looks... interesting.

A lot of time seems to be going into generating backtraces.

@Keats
Copy link
Collaborator Author

Keats commented Sep 30, 2018

I removed Paginator::pagers in the next branch and huge-blog is now built in ~10s on my machine.
It looks like #436 isn't really needed though, even with 10k pages it doesn't make a significant difference in my runs.

@Freaky how fast it is to run on your beefy machine?
There are still things to fix imo like cloning of sections/pages (Arc most of the things or a refactoring to use indices instead) but it seems fast enough for the first performance oriented release, thanks to you!

@Freaky
Copy link
Contributor

Freaky commented Oct 1, 2018

huge-blog vs RAYON_NUM_THREADS:

  • 1: 31.1s
  • 2: 21.7s
  • 4: 16.9s
  • 6: 15.8s
  • 8: 14.7s
  • 12: 14.4s
  • 16: 14.2s
  • 24: 14.3s

Still runs into diminishing returns long before I run out of cores, but the negative scaling seems to have mostly gone.

Peak memory use is now down to 2.7GB even with full 24-thread concurrency - still fairly high, but much better than when we started, which was more like 24GB.

@Freaky
Copy link
Contributor

Freaky commented Oct 1, 2018

Still seeing negative scaling on my own site:

  • 1: 0.29s
  • 2: 0.28s
  • 4: 0.39s
  • 6: 0.52s
  • 8: 0.64s
  • 12: 0.93s
  • 16: 1.22s
  • 24: 1.36s

Might be worth grabbing other real-world examples and see if this is a common pattern.

@Keats
Copy link
Collaborator Author

Keats commented Oct 1, 2018

Whoa those are some big differences.
On the docs:

  • 1: 450ms
  • 2: 370ms
  • 3: 430ms
  • 4: 480ms

My blog shows roughly the same thing as the doc site: it is the fastest at 2 and gets worse and worse after that.
It really looks like we still need to do some tuning with Rayon after all, your site is 5x slower and the docs 25% slower but I'm guessing it takes longer if you have more cores

@Keats
Copy link
Collaborator Author

Keats commented Oct 1, 2018

On big-site and huge-kb, RAYON_NUM_THREAD=3 is the best for me. I don't really know what to do there.

@Freaky
Copy link
Contributor

Freaky commented Oct 1, 2018

Docs:

  • 1: 0.93s
  • 2: 0.75s
  • 4: 0.74s
  • 6: 0.81s
  • 8: 0.90s
  • 12: 1.20s
  • 16: 1.49s
  • 24: 2.10s

@Freaky
Copy link
Contributor

Freaky commented Oct 1, 2018

https://hur.st/flame/gutenberg-docs-rayon-24-99215.svg

syntect occurs in nearly 62% of the sampled stacks. Nearly 30% in get_highlighter().

@Freaky
Copy link
Contributor

Freaky commented Oct 1, 2018

40% of the total samples are in syntect::dumps::from_binary().

Instrumenting the thread_local! with different thread counts gives me:

  • 1: 77ms
  • 2: 130ms
  • 4: 310-380ms
  • 6: 450-560ms
  • 8: 590-800ms
  • 12: 880ms-1.09s
  • 24: 1.61s-1.68s

@Freaky
Copy link
Contributor

Freaky commented Oct 1, 2018

39.2% in Scope::new():

    pub fn new(s: &str) -> Result<Scope, ParseScopeError> {
        let mut repo = SCOPE_REPO.lock().unwrap();
        repo.build(s.trim())
}

@Freaky
Copy link
Contributor

Freaky commented Oct 2, 2018

A quick hack using syntect master looks promising: SyntaxSet is now Send + Sync and a plain lazy_static! can be used. After rebuilding the dump files it uses, a docs build now looks much better:

  • 1: 0.91s
  • 2: 0.65s
  • 4: 0.53s
  • 6: 0.49s
  • 8: 0.48s
  • 12: 0.47s
  • 16: 0.48s
  • 24: 0.50s

My personal site is much better too. Before:

  • 1: 0.29s
  • 2: 0.29s
  • 4: 0.37s
  • 6: 0.48s
  • 8: 0.61s
  • 12: 0.91s
  • 16: 1.21s
  • 24: 1.38s

After:

  • 1: 0.27s
  • 2: 0.22s
  • 4: 0.19s
  • 6: 0.19s
  • 8: 0.19s
  • 12: 0.20s
  • 16: 0.20s
  • 24: 0.21s

Freaky added a commit to Freaky/zola that referenced this issue Oct 2, 2018
SyntaxSet is now Send + Sync, allowing for a single instance to be
shared across all threads.  Since initialization was serialized and
per-thread previously, this is a significant win on systems with more
than a few cores, especially on smaller sites.

The regenerated syntax packdump is lacking Elm support due to a regexp
compile error.  Additional syntax support is also not yet hooked up.

This is part of getzola#420's focus on improving performance.
@Freaky
Copy link
Contributor

Freaky commented Oct 2, 2018

Updated flame graph for comparison purposes: https://hur.st/flame/gutenberg-docs-rayon-24-syntect3.svg

@Keats
Copy link
Collaborator Author

Keats commented Oct 2, 2018

That's some great news! I have been following the work on syntect and v3 should be released soon-ish

@caemor
Copy link
Contributor

caemor commented Oct 2, 2018

Wow these speedups look amazing <3

@Keats
Copy link
Collaborator Author

Keats commented Oct 3, 2018

So I tried to remove the clone by using slotmap on a WIP branch: #459 (gutenberg serve is probably broken on it if content changes)

It is .... actually slower than next right now. Loading is faster but rendering clones way too much, I can't even build huge-blog anymore. I must have messed up something pretty badly as I would have expected the number of clones of Values to be roughly identical to the current Page/Section clone and serialization...

I would appreciate some pairs of eyes on that PR to spot what I am doing wrong.

The code is still very raw and has some pretty bad parts but it builds site correctly and passes all the tests except the ones from the rebuild crate since it might need to be changed quite a bit if loading becomes very cheap.
The idea is that there is Library that holds pages and sections which are referred to via a Key generated by slotmap. Once all the pages/sections are populated with keys, it caches the Value for that page/section meaning that rendering will just clone it without going through the serialization again.

I believe it would be possible to remove a good chunk of clone by passing borrowed Values to Tera::render but that requires to check if it's possible first and then to add the code for it.

@Freaky
Copy link
Contributor

Freaky commented Oct 3, 2018

Ooh, neat. huge-blog builds here, but with nearly 2x the memory use and 2x the runtime.

Flamegraph: https://hur.st/flame/gutenberg-huge-blog-slotmap-10aba2.svg

Nearly half of runtime's in serde.

@Keats
Copy link
Collaborator Author

Keats commented Oct 4, 2018

Yep I get the same results with valgrind, it's all spent cloning Values. I don't really understand why it is copying twice more than before though, I expected it to be the same :/

@Freaky
Copy link
Contributor

Freaky commented Oct 4, 2018

Well, now instead of going from {Page,Section} -> Value, it's going {Page,Section} -> Value -> Value, no?

Without Tera being able to borrow it, I don't think it's really going to help.

@Keats
Copy link
Collaborator Author

Keats commented Oct 5, 2018

@Freaky I've pushed a commit that uses this commit Keats/tera@efb8af8 and we're back to reasonable speed o/
An additional speedup would be to have some way of taking some Vec<&Value> in Tera but that looks very messy. There might still be some silly things in the current code though!

Next step is to clean/document the code and rewrite the rebuild component as it is now ~30x faster to call populate_sections for example.

@Freaky
Copy link
Contributor

Freaky commented Oct 5, 2018

Yes, that's much better - half the runtime, nearly half the memory use, though latter's slightly higher than the previous baseline - 2.7GB -> 5GB -> 2.9GB.

That's only 1.5% of my main machine, but it's probably worth considering it's also about 75% of a lot of systems, particularly cheap VPS'.

@Keats
Copy link
Collaborator Author

Keats commented Oct 5, 2018

That's only 1.5% of my main machine, but it's probably worth considering it's also about 75% of a lot of systems, particularly cheap VPS'.

To be fair you probably don't build a 10k pages site on a VPS. Smashing magazine moved to Hugo and only had 7500 pages: https://discourse.gohugo.io/t/smashing-magazine-s-redesign-powered-by-hugo-jamstack/5826/8

@Keats
Copy link
Collaborator Author

Keats commented Oct 5, 2018

Turns out the caching layer is actually completely useless and therefore so is Tera render_with_borrowed, which I should probably remove.

Caching would only help if it was somehow possible to pass Vec<&Value> for taxonomies. It might be possible to cache the SerializedSection to avoid some looping but I'm not sure it would make a big difference.

@Keats
Copy link
Collaborator Author

Keats commented Oct 6, 2018

#459 is ready for reviews if anyone has time!

@Keats
Copy link
Collaborator Author

Keats commented Oct 9, 2018

Slotmap + syntect 3 are in the next branch.

gif

@Freaky
Copy link
Contributor

Freaky commented Oct 9, 2018

Excellent. I'm seeing maybe ~5% performance bump on huge-blog, but also a ~13% reduction in memory use.

Syntect savings are as seen in my experimental branch, which is probably the biggest thing for me - 1.4s to 0.2s is quite noticeable :)

@Keats
Copy link
Collaborator Author

Keats commented Oct 11, 2018

Latest idea: #480

This would solve some of the repeated serializations we're doing and very often not even using at all but is a slightly worse UX so I'm a bit conflicted.

@Keats
Copy link
Collaborator Author

Keats commented Nov 16, 2018

Memory usage is still way too high but that's good enough for the 0.5.0 release tomorrow (hopefully)

@tshepang
Copy link
Contributor

What figures are you seeing?

@Keats
Copy link
Collaborator Author

Keats commented Nov 16, 2018

To render a blog with 10k pages with taxonomies + pagination + syntax highlighting: around 15s and 3GB of ram.

@Keats Keats closed this as completed Nov 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants