New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Review usage of Rayon & improve performance #420
Comments
Heh. Spent a few minutes on this. Before:
After:
This call to
|
That's interesting :o Just to be sure though: you are not running it in Can you push the branch with timings on? Looks like a good base to experiment on. From what I've seen with more pages (1000-10000), ~95% of the time is spent in https://github.com/Keats/gutenberg/blob/ae7a65b51f3dda4d6789483e930574437c6651e6/components/site/src/lib.rs#L850-L855 writing the pages to disk, which should be pretty fast in theory. |
That is release mode, and it's generating my own site. (The actual This is on a dual hex core Westmere Xeon, so default thread count is 24. A quick once-over of syscall activity suggests a lot of contention and yielding with higher thread counts, but cutting it with |
I'll push a test branch in a bit, got soup to make and eat first :) |
I think https://docs.rs/rayon/1.0.2/rayon/iter/trait.IndexedParallelIterator.html#method.with_min_len should help to avoid wasting time parallelizing small things. Rendering 10k pages from a single section however should be done concurrently. |
medium-kb (1000 pages):
The scaling is just pants. |
Something feels wrong, surely with n threads it should be faster than a single one since they don't do any locking... Maybe I'm using rayong wrongly |
Tried a quick hack with crossbeam scope and channel and see basically the same thing, with scaling for rendering stopping around 4 and going negative soon after. So whatever the problem it doesn't seem rayon-specific. |
Dear me.
A build on medium-kb spends about half its time just checking orphans by repeatedly searching a vec. Twice. I dread to think how long it would take on a huge site :/ After replacing with a HashSet:
The improvement increases with larger sites. |
Heh, huge-kb goes from 150 seconds to 9 :) |
Pull request in #424. |
I guess this can be closed now :o |
Just saw Freaky@986fda25e66a04a0b6751f305acee341bc0318c8do you want to do a PR with it as well? |
Might be worth using dedicated thread pools for IO and rendering, rather than just throwing everything on the global rayon pool. From what I've seen the IO-bound stuff scales fairly well (at least with SSD/from cache, HDD's might disagree), while rendering bottlenecks quite quickly, at least on my machine.
|
PR #427 for the fold/reduce → collect tweak. |
I did a few more tests and while huge-kb is now very fast to render, big-blog is still slow: 44s on my machine. |
I also had a look at replacing some clone() by using |
So, erm. huge-blog.
It's up and down like a yoyo, peaking at 24GB, dropping to 5GB, then peaking back at 24GB. Over and over. 44 seconds? I killed it after 6 minutes and nearly 2 hours of CPU burnt roughly equally between user and system. |
What on earth.
It takes over 6 seconds just to work out what name the template should have? |
Right. So. This bit. let template_name = match self.root {
PaginationRoot::Section(s) => {
context.insert("section", &s);
s.get_template_name()
} from If I comment that one line out, huge-blog builds in 39 seconds and peaks at 3.9GB instead of 24. |
I somehow missed that line yesterday 😱 |
After doing that (6903975), |
Looking at https://forestry.io/blog/hugo-vs-jekyll-benchmark/ it still seems Gutenberg is about 5-10x slower than Hugo but it is at least in the same ballpark now. |
There's a lot of noise, but this flame graph looks... interesting. A lot of time seems to be going into generating backtraces. |
I removed Paginator::pagers in the next branch and @Freaky how fast it is to run on your beefy machine? |
huge-blog vs RAYON_NUM_THREADS:
Still runs into diminishing returns long before I run out of cores, but the negative scaling seems to have mostly gone. Peak memory use is now down to 2.7GB even with full 24-thread concurrency - still fairly high, but much better than when we started, which was more like 24GB. |
Still seeing negative scaling on my own site:
Might be worth grabbing other real-world examples and see if this is a common pattern. |
Whoa those are some big differences.
My blog shows roughly the same thing as the doc site: it is the fastest at 2 and gets worse and worse after that. |
On big-site and huge-kb, RAYON_NUM_THREAD=3 is the best for me. I don't really know what to do there. |
Docs:
|
https://hur.st/flame/gutenberg-docs-rayon-24-99215.svg syntect occurs in nearly 62% of the sampled stacks. Nearly 30% in |
40% of the total samples are in Instrumenting the
|
39.2% in pub fn new(s: &str) -> Result<Scope, ParseScopeError> {
let mut repo = SCOPE_REPO.lock().unwrap();
repo.build(s.trim())
} |
A quick hack using syntect master looks promising:
My personal site is much better too. Before:
After:
|
SyntaxSet is now Send + Sync, allowing for a single instance to be shared across all threads. Since initialization was serialized and per-thread previously, this is a significant win on systems with more than a few cores, especially on smaller sites. The regenerated syntax packdump is lacking Elm support due to a regexp compile error. Additional syntax support is also not yet hooked up. This is part of getzola#420's focus on improving performance.
Updated flame graph for comparison purposes: https://hur.st/flame/gutenberg-docs-rayon-24-syntect3.svg |
That's some great news! I have been following the work on syntect and v3 should be released soon-ish |
Wow these speedups look amazing <3 |
So I tried to remove the It is .... actually slower than I would appreciate some pairs of eyes on that PR to spot what I am doing wrong. The code is still very raw and has some pretty bad parts but it builds site correctly and passes all the tests except the ones from the I believe it would be possible to remove a good chunk of clone by passing borrowed |
Ooh, neat. huge-blog builds here, but with nearly 2x the memory use and 2x the runtime. Flamegraph: https://hur.st/flame/gutenberg-huge-blog-slotmap-10aba2.svg Nearly half of runtime's in serde. |
Yep I get the same results with valgrind, it's all spent cloning Values. I don't really understand why it is copying twice more than before though, I expected it to be the same :/ |
Well, now instead of going from Without Tera being able to borrow it, I don't think it's really going to help. |
@Freaky I've pushed a commit that uses this commit Keats/tera@efb8af8 and we're back to reasonable speed o/ Next step is to clean/document the code and rewrite the rebuild component as it is now ~30x faster to call |
Yes, that's much better - half the runtime, nearly half the memory use, though latter's slightly higher than the previous baseline - 2.7GB -> 5GB -> 2.9GB. That's only 1.5% of my main machine, but it's probably worth considering it's also about 75% of a lot of systems, particularly cheap VPS'. |
To be fair you probably don't build a 10k pages site on a VPS. Smashing magazine moved to Hugo and only had 7500 pages: https://discourse.gohugo.io/t/smashing-magazine-s-redesign-powered-by-hugo-jamstack/5826/8 |
Turns out the caching layer is actually completely useless and therefore so is Tera Caching would only help if it was somehow possible to pass |
#459 is ready for reviews if anyone has time! |
Excellent. I'm seeing maybe ~5% performance bump on huge-blog, but also a ~13% reduction in memory use. Syntect savings are as seen in my experimental branch, which is probably the biggest thing for me - 1.4s to 0.2s is quite noticeable :) |
Latest idea: #480 This would solve some of the repeated serializations we're doing and very often not even using at all but is a slightly worse UX so I'm a bit conflicted. |
Memory usage is still way too high but that's good enough for the 0.5.0 release tomorrow (hopefully) |
What figures are you seeing? |
To render a blog with 10k pages with taxonomies + pagination + syntax highlighting: around 15s and 3GB of ram. |
It looks like the site loading is done in parallel but the rendering is not: threads are spawned but only one seem used.
TODOs:
Site::render_section
to ensure it uses as many threads as possible efficiently, might be a case of using https://docs.rs/rayon/1.0.2/rayon/iter/trait.IndexedParallelIterator.html#method.with_min_lenThere are some benches in
site/benches
but you will need to run thegen.py
first to generate some. Given the current speed, themedium-blog
andmedium-kb
are probably the best ones to run.The text was updated successfully, but these errors were encountered: