Add support for whitespace control character allowing for ERB like syntax by evulse · Pull Request #743 · Shopify/liquid

evulse · 2016-04-13T16:43:40Z

This is a continuation of #215 and #216 . It contains all PR changes that were needed for @nickpearson to complete his PR. @fw42

…hitespace syntax

nickpearson · 2016-04-13T19:11:01Z

+          source = source.gsub(white_space[0], white_space[1])
+        end
+      end
+


For sake of readability, your block parameter might be better as:

WhiteSpaceTrimTags.each do |(regexp, string)| source = source.gsub(regexp, string)

fw42 · 2016-04-14T23:31:00Z

Hmm not sure why Travis CI didn't run your tests here..

fw42 · 2016-04-14T23:32:29Z

+
+      if source =~ WhiteSpaceTrimCheck
+        WhiteSpaceTrimTags.each do |(regexp, string)|
+          source = source.gsub(regexp, string)


Hmm. I'm a bit concerned that this is essentially doing 4 passes (once for each regexp) over the entire source. I can't imagine that to not be a considerable performance hit.

tobi · 2016-04-15T00:03:18Z

Like I said in the other ticket. We really need the white space work to be included in the existing parser tokenizer. Liquid is used in many super high performance situations. 1% slowdown can easily lead to 100s of thousands in hardware costs for Shopify alone. Performance matters a lot and multiple passes are the death of performance

evulse · 2016-04-15T00:21:26Z

New changes bring it down to the tokenizer + whitespace pass but your right no reason it can't be a part of the tokenizer and just skip the whitespace during that step.

evulse · 2016-04-15T03:39:49Z

The overhead of parsing the white space to the parser tokenizer creates more performance issues than the current proposed solution of just clearing it out in one pass first.

…niform output behaviour.

evulse · 2016-04-15T03:58:36Z

I think the only way to handle white space as part of the parser tokenizer without a massive performance hit is to move away from the current regex solution and in to a stream parser and hand off tokens as they are found. If the parser remembers the previous token and uses logic based on the next token to decide how to process it then the stream only ever needs to move in one direction and only ever need to store a single previous token in memory at any one stage.

I've attempted regex look forwards and look backs as part of the parser and it just adds way to much performance loss.

evulse · 2016-04-15T13:24:49Z

I've come to the conclusion that this solution is in the top 95% percentile of all possible solutions. I went down the path of changing the tokenizer but to be honest it was starting to feel like this was no longer the same project. The simplicity of the tokenizer is core to the library and as many people use it, changing the tokenizer so heavily would no doubt introduce breaking changes for people who run custom regex for syntax changes.

Honestly I think the double regex solution above is the perfect balance of simplicity and performance. The regex passes have better performance separated than what they do have combined. For any liquid not using the new syntax it has a very performant regex check to avoid the full regex scan.

nickpearson · 2016-04-15T14:26:45Z

One of the proposed solutions that crops up in the comments of these whitespace issues is to allow newlines inside of Liquid tags and outputs. I had tried this in the past but always got a Liquid::SyntaxError: Variable '{{' was not properly terminated with regexp: /\}\}/ error.

However, I just tried the following on the master branch, and it works:

str = "
a{{
  'b'
}}c{%
  if true
%}d{%
  endif 
%}e
".strip
Liquid::Template.parse(str).render
#=> "abcde"

That's of course a contrived example, but it shows that whitespace can be placed inside Liquid tags/outputs where it will be swallowed automatically. Here's a real-world example:

<ul>{%
  for product in products %}
    <li>{{ product.title }}</li>{%
  endfor %}
</ul>

which, with two products, will render as:

<ul>
    <li>Shirt</li>
    <li>Pants</li>
</ul>

Here's the code to run in a console if you want to try it out:

html = "
<ul>{%
  for product in products %}
    <li>{{ product.title }}</li>{%
  endfor %}
</ul>
".strip
data = { 'products' => [{ 'title' => 'Shirt' }, { 'title' => 'Pants' }] }
puts Liquid::Template.parse(html).render(data)

The syntax isn't as pretty as the {{- var -}} and {%- if x -%}, but the important part is that it works.

Is this behavior something that "just works," or is this something Liquid can commit to supporting officially? Official support might be the best option because there's no performance hit (no code changes needed). The only thing to add, I think, would be tests.

dylanahsmith · 2016-04-15T16:13:53Z

@nickpearson Yes, newlines in tags is officially supported. It was intentionally added in #324 along with tests.

I would still like liquid to support the ERB like syntax, since I think it makes liquid code more readable for template writers. I think we can address Shopify performance concerns with this feature in liquid-c

dylanahsmith · 2016-04-15T17:16:20Z

      return [] if source.to_s.empty?
+
+      if source =~ WhiteSpaceTrimCheck
+          source = source.gsub(WhiteSpaceTrimTags, WhiteSpaceTrimReplace)


This will affect the content of raw tags as well

I think this will also affect line numbers, which will be confusing

I'll added some tests for this. This really means the logic needs to be in the parser.

Yes, I think Liquid::BlockBody#parse is the right place for this logic.

Need to work something out. Currently the whitespace would be in the previous token so either need to bring the whitespace in to the current token which kills the performance of the tokenizer regex split or create logic that looks back and modifies the stack if it finds a whitespace control symbol.

I just had a look at the liquid-c parser. I think based on that its better to change the tokenizer first as honestly this is where it should be handled to get the outer white space in the same token. I'll just tweak the performance the best I can in ruby as the c side won't have an issue looking at it.

Currently the whitespace would be in the previous token so either need to bring the whitespace in to the current token which kills the performance of the tokenizer regex split or create logic that looks back and modifies the stack if it finds a whitespace control symbol.

In the ruby implementation, I was thinking that BlockBody#parse can modify the last item in the nodelist to rstrip whitespace to trim the previous token.

Trimming the following token might be simplest to handle with the help of the tokenizer by setting a flag on the tokenizer to tell it to lstrip the next token. That avoids the complication of interacting with child tags which may or may not be blocks.

liquid-c is able to handle things in the tokenizer more easily, since it lazily tokenizes in a way that allows trimming to be turned off while parsing the raw tag. When matching a TOKEN_RAW token it actually need to read-ahead to find where it ends, so could easily check for an additional dash and trim whitespace from the end. When matching a tag or variable in tokenizer_next, it could easily check for a dash preceding the end of the tag, then set a flag on the tokenizer to ltrim whitespace on the next token.

…lean up

evulse · 2016-04-16T01:48:48Z

The new change is slightly slower than the previous approach however I think it still has some optimisations I can do. It does however pass all tests and handles raw tags correctly.

Also I don't generally write in ruby so anything that can be written in a better syntax can you code review

Comments?

evulse · 2016-04-16T01:49:17Z

Also by slower I mean marginally, it could actually end up being faster once I do some more tweaks.

dylanahsmith · 2016-04-16T01:55:19Z

@@ -1,9 +1,16 @@
 module Liquid
  class Block < Tag
-    IsTag             = /^#{TagStart}/o


You seem to be working off a really old commit. See how different the file is now: https://github.com/Shopify/liquid/blob/master/lib/liquid/block.rb

That explains the merge conflict message. Updating my fork now as your right its from 2013.

evulse · 2016-04-16T03:27:40Z

Closing due to old commit

Add support for whitespace control character allowing for ERB style w…

a5d125b

…hitespace syntax

nickpearson reviewed Apr 13, 2016
View reviewed changes

Improve Readability and Consistency

189cdf7

fw42 reviewed Apr 14, 2016
View reviewed changes

Michael Angell added 2 commits April 15, 2016 10:05

Compress to single Regex

67cfc4c

Use existing variables where available

424173d

More performance improvements by simplifying regex. Provides a more u…

ecec66b

…niform output behaviour.

Fix incorrect syntax causing failed tests

a01ac62

dylanahsmith reviewed Apr 15, 2016
View reviewed changes

Shift logic to block parse. Still needs some performance tuning and c…

cb9ac4a

…lean up

dylanahsmith reviewed Apr 16, 2016
View reviewed changes

Code cleanup to match common code design

0cbeca8

evulse closed this Apr 16, 2016

Conversation

evulse commented Apr 13, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fw42 commented Apr 14, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tobi commented Apr 15, 2016

Uh oh!

evulse commented Apr 15, 2016

Uh oh!

evulse commented Apr 15, 2016

Uh oh!

evulse commented Apr 15, 2016

Uh oh!

evulse commented Apr 15, 2016

Uh oh!

nickpearson commented Apr 15, 2016

Uh oh!

dylanahsmith commented Apr 15, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

evulse commented Apr 16, 2016

Uh oh!

evulse commented Apr 16, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

evulse commented Apr 16, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants