Add support for whitespace control character allowing for ERB like syntax#743
Add support for whitespace control character allowing for ERB like syntax#743evulse wants to merge 8 commits intoShopify:masterfrom evulse:whitespace-dash-feature
Conversation
| source = source.gsub(white_space[0], white_space[1]) | ||
| end | ||
| end | ||
|
|
There was a problem hiding this comment.
For sake of readability, your block parameter might be better as:
WhiteSpaceTrimTags.each do |(regexp, string)|
source = source.gsub(regexp, string)|
Hmm not sure why Travis CI didn't run your tests here.. |
|
|
||
| if source =~ WhiteSpaceTrimCheck | ||
| WhiteSpaceTrimTags.each do |(regexp, string)| | ||
| source = source.gsub(regexp, string) |
There was a problem hiding this comment.
Hmm. I'm a bit concerned that this is essentially doing 4 passes (once for each regexp) over the entire source. I can't imagine that to not be a considerable performance hit.
|
Like I said in the other ticket. We really need the white space work to be included in the existing parser tokenizer. Liquid is used in many super high performance situations. 1% slowdown can easily lead to 100s of thousands in hardware costs for Shopify alone. Performance matters a lot and multiple passes are the death of performance |
|
New changes bring it down to the tokenizer + whitespace pass but your right no reason it can't be a part of the tokenizer and just skip the whitespace during that step. |
|
The overhead of parsing the white space to the parser tokenizer creates more performance issues than the current proposed solution of just clearing it out in one pass first. |
…niform output behaviour.
|
I think the only way to handle white space as part of the parser tokenizer without a massive performance hit is to move away from the current regex solution and in to a stream parser and hand off tokens as they are found. If the parser remembers the previous token and uses logic based on the next token to decide how to process it then the stream only ever needs to move in one direction and only ever need to store a single previous token in memory at any one stage. I've attempted regex look forwards and look backs as part of the parser and it just adds way to much performance loss. |
|
I've come to the conclusion that this solution is in the top 95% percentile of all possible solutions. I went down the path of changing the tokenizer but to be honest it was starting to feel like this was no longer the same project. The simplicity of the tokenizer is core to the library and as many people use it, changing the tokenizer so heavily would no doubt introduce breaking changes for people who run custom regex for syntax changes. Honestly I think the double regex solution above is the perfect balance of simplicity and performance. The regex passes have better performance separated than what they do have combined. For any liquid not using the new syntax it has a very performant regex check to avoid the full regex scan. |
|
One of the proposed solutions that crops up in the comments of these whitespace issues is to allow newlines inside of Liquid tags and outputs. I had tried this in the past but always got a However, I just tried the following on the master branch, and it works: str = "
a{{
'b'
}}c{%
if true
%}d{%
endif
%}e
".strip
Liquid::Template.parse(str).render
#=> "abcde"That's of course a contrived example, but it shows that whitespace can be placed inside Liquid tags/outputs where it will be swallowed automatically. Here's a real-world example: <ul>{%
for product in products %}
<li>{{ product.title }}</li>{%
endfor %}
</ul>which, with two products, will render as: <ul>
<li>Shirt</li>
<li>Pants</li>
</ul>Here's the code to run in a console if you want to try it out: html = "
<ul>{%
for product in products %}
<li>{{ product.title }}</li>{%
endfor %}
</ul>
".strip
data = { 'products' => [{ 'title' => 'Shirt' }, { 'title' => 'Pants' }] }
puts Liquid::Template.parse(html).render(data)The syntax isn't as pretty as the Is this behavior something that "just works," or is this something Liquid can commit to supporting officially? Official support might be the best option because there's no performance hit (no code changes needed). The only thing to add, I think, would be tests. |
|
@nickpearson Yes, newlines in tags is officially supported. It was intentionally added in #324 along with tests. I would still like liquid to support the ERB like syntax, since I think it makes liquid code more readable for template writers. I think we can address Shopify performance concerns with this feature in liquid-c |
| return [] if source.to_s.empty? | ||
|
|
||
| if source =~ WhiteSpaceTrimCheck | ||
| source = source.gsub(WhiteSpaceTrimTags, WhiteSpaceTrimReplace) |
There was a problem hiding this comment.
This will affect the content of raw tags as well
There was a problem hiding this comment.
I think this will also affect line numbers, which will be confusing
There was a problem hiding this comment.
I'll added some tests for this. This really means the logic needs to be in the parser.
There was a problem hiding this comment.
Yes, I think Liquid::BlockBody#parse is the right place for this logic.
There was a problem hiding this comment.
Need to work something out. Currently the whitespace would be in the previous token so either need to bring the whitespace in to the current token which kills the performance of the tokenizer regex split or create logic that looks back and modifies the stack if it finds a whitespace control symbol.
There was a problem hiding this comment.
I just had a look at the liquid-c parser. I think based on that its better to change the tokenizer first as honestly this is where it should be handled to get the outer white space in the same token. I'll just tweak the performance the best I can in ruby as the c side won't have an issue looking at it.
There was a problem hiding this comment.
Currently the whitespace would be in the previous token so either need to bring the whitespace in to the current token which kills the performance of the tokenizer regex split or create logic that looks back and modifies the stack if it finds a whitespace control symbol.
In the ruby implementation, I was thinking that BlockBody#parse can modify the last item in the nodelist to rstrip whitespace to trim the previous token.
Trimming the following token might be simplest to handle with the help of the tokenizer by setting a flag on the tokenizer to tell it to lstrip the next token. That avoids the complication of interacting with child tags which may or may not be blocks.
liquid-c is able to handle things in the tokenizer more easily, since it lazily tokenizes in a way that allows trimming to be turned off while parsing the raw tag. When matching a TOKEN_RAW token it actually need to read-ahead to find where it ends, so could easily check for an additional dash and trim whitespace from the end. When matching a tag or variable in tokenizer_next, it could easily check for a dash preceding the end of the tag, then set a flag on the tokenizer to ltrim whitespace on the next token.
|
The new change is slightly slower than the previous approach however I think it still has some optimisations I can do. It does however pass all tests and handles raw tags correctly. Also I don't generally write in ruby so anything that can be written in a better syntax can you code review Comments? |
|
Also by slower I mean marginally, it could actually end up being faster once I do some more tweaks. |
| @@ -1,9 +1,16 @@ | |||
| module Liquid | |||
| class Block < Tag | |||
| IsTag = /^#{TagStart}/o | |||
There was a problem hiding this comment.
You seem to be working off a really old commit. See how different the file is now: https://github.com/Shopify/liquid/blob/master/lib/liquid/block.rb
There was a problem hiding this comment.
That explains the merge conflict message. Updating my fork now as your right its from 2013.
|
Closing due to old commit |
This is a continuation of #215 and #216 . It contains all PR changes that were needed for @nickpearson to complete his PR. @fw42