For performance, consider accessing user supplied context by reference #298

patefacio · 2018-06-04T12:05:13Z

In this branch (https://github.com/patefacio/tera/tree/less_clone) I've updated the Renderer code to, where possible, not do deep copies (ie clones). It required some refactor. The strategy was the Renderer keeps a borrow of user supplied context and passes that on to an AstProcessor. This way an instance of a class is not required to have reference to another field in same struct - which I think is impossible in safe rust. I added a RefOrOwned class to treat an immutable ref and a owned type transparently. Then the actual processor can clone those RefOrOwned instances and it is a large pointer copy. I have not finished all features yet so I have more work to do, but the initial results are promising.

Looking for comments, suggestions.

Perhaps my usage is pathological; I have a large context to generate this html: (https://patefacio.github.io/plusauri.github.io/html/forecast/this_just_in_bogle_sample.plusauri.as_modeled.uai_true.html)

For my particular template the performance change is stark: (durations in milliseconds).

7 matches for "took" in buffer: plus_release_server.out
2400:Render (less_clone) for template took 67: For template (forecast) with parents ([])
4867:Render (less_clone) for template took 62: For template (forecast) with parents ([])
7332:Render (less_clone) for template took 56: For template (forecast) with parents ([])
9797:Render (less_clone) for template took 59: For template (forecast) with parents ([])
12262:Render (less_clone) for template took 65: For template (forecast) with parents ([])
14727:Render (less_clone) for template took 59: For template (forecast) with parents ([])
17192:Render (less_clone) for template took 58: For template (forecast) with parents ([])

8 matches for "took" in buffer: plus_release_server.out
2:Render (tera_ref) for template took 1909: For template (forecast) with parents ([])
70:Render (tera_ref) for template took 2105: For template (forecast) with parents ([])
137:Render (tera_ref) for template took 1999: For template (forecast) with parents ([])
204:Render (tera_ref) for template took 2003: For template (forecast) with parents ([])
271:Render (tera_ref) for template took 1931: For template (forecast) with parents ([])
338:Render (tera_ref) for template took 2062: For template (forecast) with parents ([])
405:Render (tera_ref) for template took 2073: For template (forecast) with parents ([])
472:Render (tera_ref) for template took 2034: For template (forecast) with parents ([])

Keats · 2018-06-06T08:33:22Z

Can you make a PR for the benches you've added (although I would keep the current huge_loop one as is since it was an example from handlebars and create another one for loop nesting)?

That's a pretty huge rewrite of the renderer though and it would need to pass all the tests (currently 24 failing) before I take a good look at it though. Is rewriting the whole renderer necessary or is mostly related to for loops?

patefacio · 2018-06-06T12:09:05Z

Maybe a more skillful simple replacement could be done. I tried first to simply replace Value with &'a Value and just could not make it work. I think RefOrOwned concept, which may exist elsewhere in rust as standard, is required to stop deep cloning and that required an extra class. That is, the work could not be done in Renderer::render because you need that main function to hold ref to user context to ensure lifetime is long enough struct doing for real work, which I called AstProcessor.

I understand your reservations and have them also. I'll be glad to try to get those tests working - I just wanted some thoughts before continuing. I will also make a PR for the benches leaving original huge_loop.

patefacio · 2018-06-14T17:17:42Z

I've got almost all the tests working. I believe the only ones not working are related to error processing where the tests look for specific error text. For some, I updated the required text to match slightly new output. I imagine the text of errors is not considered part of the api? Anyway, I never really acclimated to rust error_chain; Honestly I find it difficult to use/understand.

I believe the updates will be faster for many/most renderings that involve any significant context data. For very small contexts in tight loops it might be worse. Here are new benchmarks vs current as run on my macbook. For my specific use case it takes from 2,000ms down to under 100ms - but I may be doing something atypical in my templates. @Keats - Let me know what you think:

less_clone

running 2 tests
test bench_big_loop_big_object ... bench:     375,215 ns/iter (+/- 67,384)
test bench_macro_big_object    ... bench:     704,519 ns/iter (+/- 64,984)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured; 0 filtered out

     Running target/release/deps/tera-6527d5d84c556326

running 13 tests
test access_deep_object                        ... bench:       6,721 ns/iter (+/- 1,537)
test access_deep_object_with_literal           ... bench:       7,856 ns/iter (+/- 550)
test bench_build_inheritance_chains            ... bench:       3,185 ns/iter (+/- 225)
test bench_escape_html                         ... bench:          98 ns/iter (+/- 5)
test bench_huge_loop                           ... bench:     769,062 ns/iter (+/- 90,359)
test bench_parsing_basic_template              ... bench:      22,946 ns/iter (+/- 3,289)
test bench_parsing_with_inheritance_and_macros ... bench:      44,286 ns/iter (+/- 4,726)
test bench_rendering_basic_template            ... bench:       7,290 ns/iter (+/- 606)
test bench_rendering_inheritance_and_macros    ... bench:       8,220 ns/iter (+/- 489)
test bench_rendering_only_inheritance          ... bench:       3,074 ns/iter (+/- 617)
test bench_rendering_only_macro_call           ... bench:       1,127 ns/iter (+/- 103)
test bench_rendering_only_parent               ... bench:       1,932 ns/iter (+/- 182)
test bench_rendering_only_variable             ... bench:       2,248 ns/iter (+/- 136)

Original

running 2 tests
test bench_big_loop_big_object ... bench:   3,027,435 ns/iter (+/- 1,185,983)
test bench_macro_big_object    ... bench:   1,817,381 ns/iter (+/- 123,192)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured; 0 filtered out

     Running /Users/dbdavidson/dev/open_source/tera_bench/target/release/deps/tera-dd530086e3e4cc51

running 13 tests
test access_deep_object                        ... bench:       8,280 ns/iter (+/- 852)
test access_deep_object_with_literal           ... bench:      11,206 ns/iter (+/- 1,858)
test bench_build_inheritance_chains            ... bench:       3,355 ns/iter (+/- 448)
test bench_escape_html                         ... bench:          92 ns/iter (+/- 5)
test bench_huge_loop                           ... bench:   2,265,449 ns/iter (+/- 700,556)
test bench_parsing_basic_template              ... bench:      21,018 ns/iter (+/- 3,146)
test bench_parsing_with_inheritance_and_macros ... bench:      44,102 ns/iter (+/- 3,045)
test bench_rendering_basic_template            ... bench:       6,980 ns/iter (+/- 894)
test bench_rendering_inheritance_and_macros    ... bench:       9,637 ns/iter (+/- 2,183)
test bench_rendering_only_inheritance          ... bench:       3,022 ns/iter (+/- 287)
test bench_rendering_only_macro_call           ... bench:       1,098 ns/iter (+/- 88)
test bench_rendering_only_parent               ... bench:       1,699 ns/iter (+/- 208)
test bench_rendering_only_variable             ... bench:       1,835 ns/iter (+/- 113)

test result: ok. 0 passed; 0 failed; 0 ignored; 13 measured; 0 filtered out

Keats · 2018-06-15T09:36:17Z

It's interesting that looking at the benchmarks it only really affects the loop so that's really the bottleneck, nothing else. I don't think I will have time this week but I should be able to try having a look on my end next week.

I imagine the text of errors is not considered part of the api?

It is not.

I'll probably switch from error-chain to basic errors but that's a breaking change afaik so that will have to wait for a major version. Also waiting to see where failure is going before doing any error refactoring.

patefacio · 2018-06-17T00:35:08Z

I think it is more than the loop. Maybe an example of "correlation is not causation"? I believe the issue really is the combination of (1) storing by value and (2) using Value throughout as return on all eval_.... Because of that macro invocations, set assignments and loops have no recourse but to clone. For example, I've added two new benchmarks in tera_less_clone () with no loops. One with a set ... statement, which a client could assume is like a reference but is really a deep copy and another with a macro invocation.

Also, originally I had assumed it was Rocket that was the culprit. The reason is I was timing my rocket route handler and it was fast (millis) yet it took two seconds to make it to browser - so I assumed rocket middleware on transit. It turns out that the request to render in rocket really only deferred the call to tera rendering until after return of the route. Anyway, I was discussing some with Rocket community and in the process I created this gist: https://gist.github.com/patefacio/d861650e7297b4d27c6e0c591a43cb12

If you look at the last file you'll see an image in the call stack and when I rummaged through it a very large percentage of time was in clone. That is why I think it is more than looping. Of course, with looping it will show big because it does clone there as well.

tera_less_clone

running 2 tests
test bench_macro_big_object_no_loop_macro_call ... bench:   6,393,256 ns/iter (+/- 563,981)
test bench_macro_big_object_no_loop_with_set   ... bench:   6,702,669 ns/iter (+/- 626,915)

original

running 2 tests
test bench_macro_big_object_no_loop_macro_call ... bench:  13,463,490 ns/iter (+/- 1,026,651)
test bench_macro_big_object_no_loop_with_set   ... bench:  13,553,011 ns/iter (+/- 1,198,475)

Keats · 2018-06-25T18:39:56Z

Let me know when everything is working and i'll have a look, that does look promising

patefacio · 2018-08-17T15:15:39Z

Sorry for long delay - vacation and stuff. All tests pass and although some tests I changed slightly, I believe they capture the same intent. I look forward to comments/suggestions.

Keats · 2018-08-21T14:22:31Z

no worries.
Can you open a PR for easy review?

Keats · 2018-09-02T20:07:05Z

The code has been released with 0.11.13, thanks a lot!

Keats closed this as completed Sep 2, 2018

epage mentioned this issue Feb 27, 2021

Is the "huge loop" benchmark measuring what it intends? #606

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

For performance, consider accessing user supplied context by reference #298

For performance, consider accessing user supplied context by reference #298

patefacio commented Jun 4, 2018 •

edited

Loading

Keats commented Jun 6, 2018

patefacio commented Jun 6, 2018

patefacio commented Jun 14, 2018

Keats commented Jun 15, 2018

patefacio commented Jun 17, 2018

Keats commented Jun 25, 2018

patefacio commented Aug 17, 2018

Keats commented Aug 21, 2018

Keats commented Sep 2, 2018

For performance, consider accessing user supplied context by reference #298

For performance, consider accessing user supplied context by reference #298

Comments

patefacio commented Jun 4, 2018 • edited Loading

Keats commented Jun 6, 2018

patefacio commented Jun 6, 2018

patefacio commented Jun 14, 2018

Keats commented Jun 15, 2018

patefacio commented Jun 17, 2018

Keats commented Jun 25, 2018

patefacio commented Aug 17, 2018

Keats commented Aug 21, 2018

Keats commented Sep 2, 2018

patefacio commented Jun 4, 2018 •

edited

Loading