New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streamable templates #211

Open
Riamse opened this Issue Aug 8, 2017 · 40 comments

Comments

Projects
None yet
4 participants
@Riamse
Copy link

Riamse commented Aug 8, 2017

Currently, Tera::render returns a string, which comes from Renderer::render, which builds the string by rendering each component and then using String::push_str.

Unfortunately for extremely large templates, this uses a lot of memory and might take a lot of time, and the biggest problem is that rendering is a blocking operation. It might be smarter to have a different render method that returns a struct that implements std::io::Read so it can be streamed and not store the entire response in the heap before sending it over to the server.

(I'm obviously talking about using tera in conjunction with a web application framework as opposed to tera alone, but let's be honest, that's 99% of tera's use cases)

@Riamse

This comment has been minimized.

Copy link
Author

Riamse commented Aug 10, 2017

I think a potential solution would be to map |node| { &self.render_node(node).chain_err(|| self.get_error_location())? } over the ast and then have some other struct that implements the std::io::Read trait for an arbitrary iterator of strings or characters.

@Keats

This comment has been minimized.

Copy link
Owner

Keats commented Aug 11, 2017

(I'm obviously talking about using tera in conjunction with a web application framework as opposed to tera alone, but let's be honest, that's 99% of tera's use cases)

Ah I'm not actually using Tera in a webserver :p

How big are the templates using lots of memory? I would consider that a bug if it's too memory hungry/slow. Can you add such a template to the benchmarks folder?

@Riamse

This comment has been minimized.

Copy link
Author

Riamse commented Aug 11, 2017

It's not so much the templates themselves being too big, but the data fed into the template's context. My specific use case just looks like {% for x in stuff %}<div>{{x}} example idk</div>{% endfor %}, where stuff is just a huge iterator. Memory spikes as high as gigabytes when using the current rendering algorithm, even though the final output is about half the size of the spike. And it's a blocking operation too; the majority (>90%) of the time that the request takes is spent on rendering, not processing the stuff. It's pretty scary.

I haven't investigated this closely, but I'm unsure why there would be any problem rendering the node itself after the AST is built (i.e. my template is syntactically well-formed) and the values from the context are semantically compatible with my template (e.g. if I said {% for a, b in X %}, then tera has already knows that X is a hashmap or whatever). Or in other words I'm not sure why there's a ? in &self.render_node(node).chain_err(|| self.get_error_location())?. I think there may be a way to lazily evaluate each node, almost as if we were in-order-traversing a tree, returning one rendered portion at a time in the style of an iterator. Does this sound possible?

@mexus

This comment has been minimized.

Copy link

mexus commented Aug 11, 2017

I beg your pardon, Riamse, but what are you trying to achieve with "streaming" the stuff?
I mean, in the end you will have to post everything to the webpage and wait until it's rendered.
Everything's done withing a single thread, so how do you think to gain any performance with this move? How lazy evaluation will make things run faster? Maybe I'm missing something.

Also I would suggest to look into dynamic loading, i.e. you render a page with no data then use some asynchronous mechanism to fetch the data on demand.

Another thing I can tell you that serialization is not an instant process as well (and you want your data to be serialized in order to be rendered), so that load-on-demand will help here as well. It's a more complex design to implement, but a more scalable one.

@Riamse

This comment has been minimized.

Copy link
Author

Riamse commented Aug 14, 2017

With all due respect, @mexus , I think you may have misunderstood the issue. I'm not concerned about speed of rendering, per se, and dynamic loading will not solve the problem.

My problem is that the tera templating engine stores the entire evaluated template as a string on the heap, and then returns the string. This can potentially use a lot of RAM. I propose, in effect, that when you call Renderer::render, instead of the method evaluating every single node and putting it all into a string, the method returns a T: Iterator<Item=String>, and doesn't bother to render a single node. Rather, each item in the iterator is one node, rendered only when needed.

Now, I don't know if you have any experience with Python, but here's an analogy to illustrate. What Tera does right now is:

def render(self, *args_idk):
    output = []
    for node in self.ast:
        output.append(node.render())
    return output

The feature I propose is roughly equivalent to this:

def render(self, *more_args):
    for node in self.ast:
        yield node.render()

The second one only renders nodes as needed, while the former gives me all of the nodes at once. If the web server transmits data to the client like so:

server = the_server
client = my_laptop
for piece in tera.render():
    server.send(client, piece)  # send piece to the specified client

with the current implementation, at any given moment, every single node will be stored in RAM at the same time. With my proposed implementation, at any given moment, only one node will be stored in RAM.

Here is a concrete example to illustrate what I'm saying, because what I propose is a similar strategy used when serving static files. Ubuntu hosts a 1.5 gigabyte .iso of the operating system on their own server. I am 100% sure that they don't transmit it to my computer by loading every 1.5GB of data into RAM at once and then iterating over it. What they do, I am sure, is:

  1. x=1
  2. read byte x and store it in RAM
  3. send byte x to the client
  4. delete byte x from RAM
  5. x++
  6. repeat steps 2-5 as needed

Thus, at any given time there is only one small portion of the file in RAM throughout the lifetime of the program. In fact, there's even a HTTP header just for this purpose to make it easier.

Forgive me if this sounds patronising, as I want to ensure my communications to be 100% understood. Is my explanation to your satisfaction?

@Keats

This comment has been minimized.

Copy link
Owner

Keats commented Aug 17, 2017

@Riamse I would be open to adding that in Tera but as another method than render, something like render_chunks or whatever.

@Riamse

This comment has been minimized.

Copy link
Author

Riamse commented Aug 18, 2017

That's fair.

Looking at the code, it seems difficult to implement it using the current structure due to how render_node, render_for, render_if, etc. all recursively call each other. I think what we need is a way to basically flatten the AST into an iterator, which lazily evaluates each node in the AST to give us only leaf Nodes (Text, Int, Float, Bool, Raw, and the like; these are the easy to render ones), which should solve the problem at its core. All that's left is mapping |node| -> String { node.render() } and then optionally wrap iter-read around it, for the std::io::Read trait.

@Keats

This comment has been minimized.

Copy link
Owner

Keats commented Aug 25, 2017

I'm planning to rewrite all the parsing with pest 1.0 for the next version soon but I'm not entirely sure how a flat AST would look like for the rendering.
I'll go through #206 first + rewrite the lexing/parsing and then we can see about flattening it

@Riamse

This comment has been minimized.

Copy link
Author

Riamse commented Aug 26, 2017

Great, sounds like a plan

@Riamse

This comment has been minimized.

@Keats

This comment has been minimized.

Copy link
Owner

Keats commented Sep 24, 2017

@Riamse if you want to have a look, the next version is in #218
It's not 100% done (still missing error reporting and docs update mostly) but the parser/renderer are done

@Riamse

This comment has been minimized.

Copy link
Author

Riamse commented Sep 24, 2017

I took a look but I'm not sure how it works. Could you give me a quick rundown of what happened, how to use it, differences, etc.?

@Keats

This comment has been minimized.

Copy link
Owner

Keats commented Sep 25, 2017

Sorry I should have explained more:

  • parser: has been rewritten with pest 1.0 and I've added whitespace control like liquid. Each node is now a struct so it's easier to pass around than anon enum members

  • renderer: the renderer has been cleaned up and moves into the renderer folder, that's probably where you want to tinker with. The code is roughly the same as the current version, just a bit neater.

Other than that the public API of Tera doesn't change, you can use it as before. I might do a few more changes but nothing dramatic.
The changelog has a bit more details: https://github.com/Keats/tera/pull/218/files#diff-4ac32a78649ca5bdd8e0ba38b7006a1eR3

@Riamse

This comment has been minimized.

Copy link
Author

Riamse commented Oct 9, 2017

Okay I looked at it and I think I have a better idea of what to do. I think what this issue is really about is about how for loops are rendered. Currently render_for just evaluates the whole loop (which involves recursion - this is the biggest issue) and returns a giant string.

This is a problem and I think I might know how to fix it, but I need to understand how render_for works, how it relates with self.for_loops and self.macro_context, etc.

@Keats

This comment has been minimized.

Copy link
Owner

Keats commented Oct 9, 2017

So the idea of render_for is to create a new context based on the context of what is currently being rendered + the local variable + the forloop specific variables.
Each time it encounters a forloop, it creates an object (ForLoop) that will store all the data for the loop + any extra values only present in the forloop body such as a {% set = %} in it.
To know which context to use when using variables, we need to keep track on whether the forloop is in a macro (self.macro_context) or the template being rendered (self.for_loops). This is used in lookup_ident when needed.
I don't think there is anything more to the render_for but I'm not sure where you see the recursion in there?

@Riamse

This comment has been minimized.

Copy link
Author

Riamse commented Nov 21, 2017

Got it.

I need to make the AST mutable. Specifically in a line like this.

        let ast = match self.template.parents.last() {
            // this unwrap is safe; Tera will have errored already if the template doesn't exist
            Some(parent_tpl_name) => &self.tera.get_template(parent_tpl_name).unwrap().ast,
            None => &self.template.ast,
        };
        ast.push(Node::Text("hello".to_owned()));  // what I want to be able to do

I believe I have a solution to this problem but it requires manipulating the AST while rendering occurs.

@Riamse

This comment has been minimized.

Copy link
Author

Riamse commented Nov 21, 2017

I mean to say that I can't make the ast mutable right now - trying to make the match block return &mut self.tera.get_template(parent_tpl_name).unwrap().ast doesn't work, and leads me down a rabbit hole of making lots of things mutable which is painful.

@Riamse

This comment has been minimized.

Copy link
Author

Riamse commented Nov 21, 2017

I'm doing a proof of concept

@Riamse

This comment has been minimized.

Copy link
Author

Riamse commented Nov 21, 2017

Also, I think the way that the for loops works is kind of bad. I think instead of keeping track of the variables in the for loop variables themselves, it'd make more sense to just have a Context be able to store variables in scopes: Something like struct Context { data: VecDeque<BTreeMap<String, Value>> } where data[0] is the global scope and data[n] is the variables declared at n levels of nesting. Entering a for loop appends a new BTreeMap, and exiting a for loop pops a BTreeMap off the end of the list. and if you want to lookup variable hello, you check data[n], then data[n-1], then data[n-2], etc.

Am I making sense? (https://github.com/Riamse/chom-c/blob/master/grammar.py#L83 and https://github.com/Riamse/chom-c/blob/master/c.py are examples of what I'm talking about)

@Riamse

This comment has been minimized.

Copy link
Author

Riamse commented Nov 21, 2017

To know which context to use when using variables, we need to keep track on whether the forloop is in a macro (self.macro_context) or the template being rendered (self.for_loops).

Oh. Forget I said anything then.

@Riamse

This comment has been minimized.

Copy link
Author

Riamse commented Nov 23, 2017

but I'm not sure where you see the recursion in there?

basically, render_for calls render_body, which might call render_for, etc etc. that's what I meant by recursion; but fear not, I have an idea in the works. Watch this space.

Riamse added a commit to Riamse/tera that referenced this issue Nov 24, 2017

Semi-fix to issue Keats#211 upstream
Basically, the current model of doing things is done recursively.
Rendering a node recursively renders the node's children, the childrens'
children, and so on and so on. This is bad because that means you get
the whole result all at once, which is stored into RAM.

Our solution involves rendering it piece-by-piece, starting from the
left. We do this by moving out of the recursive model and rendering
nodes as we encounter them; basically, we only ever interact with the
leftmost child of the root node. Our AST is a VecDeque of nodes that
contain nodes, i.e. the root node is an array, but the contents of that
array are pointers containing pointers.

So our goal is to lazily turn our AST into a giant list of immediately
renderable "leaves" - so basically raw text. We do this by lazily
evaluating nodes, one step at a time; the semi-evaluated result is
prepended to the beginning of the AST, like a stack - this is the key
insight.

So basically, we're only ever interacting with ast[0]. If ast[0] is an
if-else block `if cond { x } else { y }`, delete the if-else, evaluate
`cond` and prepend x to the AST if cond, otherwise prepend y to the
AST. If ast[0] is a for-loop `for i in 0..3 { i + 1 }`, it's a bit
trickier. First, we  delete the for-loop. Then we prepend a new node,
for-loop' (pronounced "for loop prime"), which indicates how far we are
in the for loop, to the AST. In this case, it's for-loop'(2, {i + 1}),
so we have 2 iterations left and the body of the for loop is `i+1`.
Next, we prepend the body of the loop to the AST. (Now it should be
clear why for-loop' indicates there's 2 iterations remaining - we are
preloading the first iteration.) Then we tell Tera that we've just
initiated a for-loop. At this point, ast[0] is now pointing to the loop
body, so all the variables should be set. When we consume the loop body,
we'll hit the for-loop'(n=2, {i + 1}). Then we'll do a similar process
to what was previously described: delete for-loop' and prepend a new
for-loop''(n-1, {i + 1}), and prepend the loop body again. When n=0 we
just delete the for-loop''''' and continue on to the next node.

That's basically the main principle we follow for doing the rest of the
nodes, where possible. Some of them haven't actually been implemented
yet though so for those we use the boring call-functions-recursively
technique. But hey it's something.
@Riamse

This comment has been minimized.

Copy link
Author

Riamse commented Nov 24, 2017

@Keats please look at the commit I just committed because it passes almost all the tests and fixes the thing and I am proud

@Riamse

This comment has been minimized.

Copy link
Author

Riamse commented Nov 24, 2017

renderer::tests::inheritance::render_super_in_top_block_errors does in fact fail, but that's because of my decision to use Option<T> instead of Result<T, E> resulting in ambiguity: when it fails, it returns None, but None is also used for no-ops, so step_once() can't distinguish between the two. This is totally fixable though.

What do you think?

@Riamse

This comment has been minimized.

Copy link
Author

Riamse commented Nov 24, 2017

Riamse@4be7b1a this one too

@Keats

This comment has been minimized.

Copy link
Owner

Keats commented Nov 24, 2017

I'll have a look at the code tomorrow - any benchmarks for memory usage between the render and iter_render?

@Keats

This comment has been minimized.

Copy link
Owner

Keats commented Nov 24, 2017

And for renderer::tests::inheritance::render_super_in_top_block_errors I would still want it to fail otherwise it's hard to provide a good error message there

I'm also curious if you can run the current benchmark to see if there's any speed difference too.

@Riamse

This comment has been minimized.

Copy link
Author

Riamse commented Nov 25, 2017

I ran the benchmarks. https://pastebin.com/W6E2NApQ for the current method of rendering things and https://pastebin.com/sEvEQxRH for iter_render. I am using Intel(R) Core(TM) i7-5500U CPU @ 2.40GHz.

I'm not sure how to measure memory usage. But basically, if your goal is to print a rendered template and not store it anywhere, iter_render should win by a landslide because you can do a series of prints on small chunks of the output string, as opposed to moving the whole thing into memory and printing it at once.

@Keats

This comment has been minimized.

Copy link
Owner

Keats commented Nov 26, 2017

I asked on Twitter (https://twitter.com/20100Prouillet/status/934705940758253568) for the memory usage.
I have to say I'm not entirely convinced though: it is significantly slower, duplicates a good chunk of the renderer and adds some not-really existing nodes in the AST. I'll wait to see if someone replies on how to test the benchmark memory so we have all the numbers to make decisions.

@Riamse

This comment has been minimized.

Copy link
Author

Riamse commented Nov 27, 2017

I'm not sure why it's slower. The only overhead should be the cost of allocating a VecDeque and skipping over no-ops.

Seems that there's also a small improvement by using a Vec instead.

@Riamse

This comment has been minimized.

Copy link
Author

Riamse commented Nov 27, 2017

Ah, one more bit of overhead - cloning body nodes (e.g. ForloopPrime). I'm not sure if there's a way around that other than reference-counting.

I have a minor issue with "not-really existing nodes" - they do, in fact, exist already, represented by the nested relation that comes with nodes having bodies. However by flattening the nested bodies piece-by-piece, we need to preserve that relation. Think of X' nodes as various types of closing brackets.

@seanlinsley

This comment has been minimized.

Copy link

seanlinsley commented Dec 9, 2017

Even if template streaming is marginally slower, from my perspective it's worth the cost because web application response times are most commonly slowed down by expensive / poorly optimized database queries, not by the web application itself mis-using CPU or memory.

While template streaming doesn't lower the amount of time it takes for the server to render its response, it does allow the browser to eagerly load render-critical resources, and possibly render the navigation UI. Especially on mobile, having to wait for render-critical assets to load after the HTML can easily add 5 seconds to the total render time (especially if the page requires JS).

@Keats

This comment has been minimized.

Copy link
Owner

Keats commented Dec 11, 2017

That's not the same kind of streaming referenced in this issue though. The PoC in that issue is basically to make the loops non-recursive to save some memory but it's not streaming: you still end up with a String.
What I believe you want @seanlinsley will need generators so you can implement something similar to https://github.com/pallets/jinja/blob/master/jinja2/environment.py#L1191

@Riamse

This comment has been minimized.

Copy link
Author

Riamse commented Dec 11, 2017

No, it is the same kind of streaming. The PoC does in fact return a String because that's what the tests expect as output. If you really wanted, you could duct-tape together something that returns an Iterator<Item=String> out of the Renderer's AST and step_once() (step_once is where the real magic happens).

@Keats

This comment has been minimized.

Copy link
Owner

Keats commented Dec 12, 2017

That is not specific to your branch though, you could apply this kind of streaming to the current render. The difference would be that an entire forloop would result into a single value instead of one per iteration

@Riamse

This comment has been minimized.

Copy link
Author

Riamse commented Dec 14, 2017

The current render wouldn't be able to do this for more than one level. Suppose that my template is following block:

{% if 1==1 %}
    {% if 1 == 1 %}
        hello
    {% endif %}
    {% if 1 == 1 %}
        world
    {% endif %}
{% endif %}

The AST looks like this: [If(condition: 1==1, body: [If(condition: 1==1, body: ["hello"]), If(condition: 1==1, body: ["world"])])]

I can't figure out a way to stream ["hello", "world"] without introducing generators into the current implementation.

@Keats

This comment has been minimized.

Copy link
Owner

Keats commented Sep 3, 2018

The latest release of Tera is much more memory efficient: it's still not streaming but it shouldn't allocate the memory for the string now, no more cloning around

@Riamse

This comment has been minimized.

Copy link
Author

Riamse commented Sep 19, 2018

I took a quick look at the code - the rendering work has been moved to https://github.com/Keats/tera/blob/master/src/renderer/processor.rs it seems? But I'm not entirely sure exactly what happened, because it seems to me that it's still allocating memory for strings at each nested level of rendering, but now we copy them into the output = String::with_capacity(10000) and then destroy them? Could you explain a bit what happened?

@Keats

This comment has been minimized.

Copy link
Owner

Keats commented Sep 19, 2018

My last message was confusing even for me now, sorry about that.

So it doesn't do anything to help with the string allocations, the main change was using references to the original context instead of clone() lots of things. For example, in the previous versions a forloop would clone all the values used and potentially clone some part again later while now it is using Cow. In practice that means allocations only happen on filter/set calls now so it is quite a bit faster and memory usage is lower.

@Riamse

This comment has been minimized.

Copy link
Author

Riamse commented Sep 23, 2018

Can we look into adding template streaming to the new renderer architecture?

@Keats

This comment has been minimized.

Copy link
Owner

Keats commented Sep 24, 2018

That would depend on #340 and whether it requires duplicating the codebase or not

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment