New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for whitespace control character. #30

Merged
merged 8 commits into from Aug 11, 2016

Conversation

Projects
None yet
5 participants
@evulse

evulse commented Jun 27, 2016

This matches up with Shopify/liquid#746 to resolve issues Shopify/liquid#216, Shopify/liquid#215, Shopify/liquid#214, Shopify/liquid#194, Shopify/liquid#171, Shopify/liquid#162

This pull request is designed to be standalone and will pass all tests without its counterpart Shopify/liquid#746 however Shopify/liquid#746 requires this pull request and will be updated with this dependancy once merged.

@fw42

This comment has been minimized.

Member

fw42 commented Jun 27, 2016

@evulse

This comment has been minimized.

evulse commented Jun 27, 2016

This adds support for {{- and {%- syntax which will lstrip! and -}} and -%} which will rstrip!

@evulse

This comment has been minimized.

evulse commented Jun 27, 2016

Now matches up to Shopify/liquid#773

@tobi

This comment has been minimized.

Member

tobi commented Jun 27, 2016

I'd love to avoid the cycle through the Ruby VM to call rstrip!/lstrip!.

Ideally we should parse white-space as a special token, then modify the token to be marked for render-skipping if {%- or -%} are encountered.

@tobi

This comment has been minimized.

Member

tobi commented Jun 27, 2016

I see now that this would make it harder to match with Shopify/liquid#746 - alright, let's not overthink it. Thanks!

@evulse

This comment has been minimized.

evulse commented Jun 27, 2016

Yeah I would have liked to avoid it but the trade off isn't worth it at this stage. To keep the functions inline on both sides makes maintenance easier. Also as I'm pulling back the last token if needed it requires conversion from a ruby object then back again where as this approach keeps it clean and only a few lines

@@ -56,6 +56,10 @@ static VALUE rb_block_parse(VALUE self, VALUE tokens, VALUE options)
case TOKEN_RAW:
{
VALUE str = rb_enc_str_new(token.str, token.length, utf8_encoding);
if(token.trim_whitespace)
rb_funcall(str, rb_intern("lstrip!"), 0);

This comment has been minimized.

@pushrax

pushrax Jun 28, 2016

Member

Why not find the offset after any whitespace (see lstrip_offset) and create the string with that offset? Avoids a memmove.

This comment has been minimized.

@pushrax

pushrax Jun 28, 2016

Member

I'm now thinking that it could be a good idea to make read_while respect the codepoint length (see other usages of read_while for what it does).

if (token.str[2] == '-') {
VALUE last_node = rb_ary_pop(nodelist);
rb_funcall(last_node, rb_intern("rstrip!"), 0);
rb_ary_push(nodelist, last_node);

This comment has been minimized.

@pushrax

pushrax Jun 28, 2016

Member

Just use rb_ary_last instead of popping and pushing.

This comment has been minimized.

@evulse

evulse Jun 29, 2016

Agreed this is a much better approach, implementing now

This comment has been minimized.

@evulse

evulse Jun 29, 2016

It looks like rb_ary_last isn't exposed so have used some of its logic directly

if (token.str[2] == '-') {
VALUE last_node = rb_ary_pop(nodelist);
rb_funcall(last_node, rb_intern("rstrip!"), 0);

This comment has been minimized.

@pushrax

pushrax Jun 28, 2016

Member

I don't mind rstrip! because it just sets the length – no memmove required. I wish they exported the string functions to C, but oh well.

if (token.str[2] == '-') {
VALUE last_node = rb_ary_pop(nodelist);
rb_funcall(last_node, rb_intern("rstrip!"), 0);
rb_ary_push(nodelist, last_node);

This comment has been minimized.

@pushrax

pushrax Jun 28, 2016

Member

ditto for rb_ary_last

@@ -68,7 +72,22 @@ static VALUE rb_block_parse(VALUE self, VALUE tokens, VALUE options)
}
case TOKEN_VARIABLE:
{
VALUE args[2] = {rb_enc_str_new(token.str + 2, token.length - 4, utf8_encoding), options};
const char *start = token.str + 2;
long end = token.length - 4;

This comment has been minimized.

@pushrax

pushrax Jun 28, 2016

Member

Call this length instead.

@@ -127,6 +129,13 @@ void tokenizer_next(tokenizer_t *tokenizer, token_t *token)
cursor++;
}
}
if(token->type == TOKEN_RAW && tokenizer->trim_whitespace ) {

This comment has been minimized.

@pushrax

pushrax Jun 28, 2016

Member

spacing

@@ -78,6 +97,17 @@ static VALUE rb_block_parse(VALUE self, VALUE tokens, VALUE options)
{
const char *start = token.str + 2, *end = token.str + token.length - 2;
if (token.str[2] == '-') {
VALUE last_node = rb_ary_pop(nodelist);
rb_funcall(last_node, rb_intern("rstrip!"), 0);

This comment has been minimized.

@pushrax

pushrax Jun 28, 2016

Member

We've been typically caching the result of rb_intern

@@ -56,6 +56,10 @@ static VALUE rb_block_parse(VALUE self, VALUE tokens, VALUE options)
case TOKEN_RAW:
{
VALUE str = rb_enc_str_new(token.str, token.length, utf8_encoding);
if(token.trim_whitespace)

This comment has been minimized.

@pushrax

pushrax Jun 28, 2016

Member

spacing

@pushrax

This comment has been minimized.

Member

pushrax commented Jun 28, 2016

Makes sense and looks pretty good overall.

const char *start = token.str + 2;
long length = token.length - 4;
if (token.str[2] == '-') {

This comment has been minimized.

@dylanahsmith

dylanahsmith Jun 30, 2016

Member

You can use start[0] or *start instead

length--;
}
if (token.str[token.length - 3] == '-') {

This comment has been minimized.

@dylanahsmith

dylanahsmith Jun 30, 2016

Member

I would use if (length > 0 && start[length - 1] == '-') { because I am concerned about a case like {{-}} would cause length to go negative, which is a scary state to be in C code.

start++;
}
if (token.str[token.length - 3] == '-') {

This comment has been minimized.

@dylanahsmith

dylanahsmith Jun 30, 2016

Member

Same concern as above, where {%-%} can cause the end pointer to come before the start pointer.

This comment has been minimized.

@dylanahsmith

dylanahsmith Jun 30, 2016

Member

I also think we should avoid this code duplication using an inline function so we can solve this problem once. This function could also be defined with the tokenizer code so whitespace trimming can be handled in one place. There could be one function to get the tag token contents (i.e. without the tag start, end or trim characters) and another to get the raw token contents (i.e. after trimming).

if(token->type == TOKEN_RAW && tokenizer->trim_whitespace) {
token->trim_whitespace = 1;
}
tokenizer->trim_whitespace = ((token->type == TOKEN_VARIABLE || token->type == TOKEN_TAG) && token->str[token->length - 3] == '-');

This comment has been minimized.

@dylanahsmith

dylanahsmith Jun 30, 2016

Member

I don't see why you are doing these checks down here when you could do them in context where token->type is set.

That would also allow you to easily detect if the next token should be trimmed, since the function finds the end of a raw string by finding the start of the next tag or variable. That means you could avoid having to construct a ruby string that needs to be right stripped. You could store that as two flags on the token for whether the raw tag should be left and/or right trimmed.

This comment has been minimized.

@evulse

evulse Jul 1, 2016

I'm actually refactoring this entire pull request around this as your right its much better to extend the tokenizer to look forward 3 characters rather than its current 2 so it can see the white space control character and remove any need for the code to ever look backwards.

@@ -13,13 +13,15 @@ typedef struct token {
enum token_type type;
const char *str;
long length;
unsigned int trim_whitespace;

This comment has been minimized.

@dylanahsmith

dylanahsmith Jun 30, 2016

Member

If you move this field after enum token_type type; then it would avoid unnecessary padding on x64, since right now it stores 4 bytes for the enum followed by 4 bytes of padding, then two 8 byte fields for str and length.

@evulse

This comment has been minimized.

evulse commented Jul 7, 2016

Alright we are passing again. Detection of the whitespace control characters has now been pushed back in to the tokenizer so there is no need to push and pop any existing values.

@tobi

This comment has been minimized.

Member

tobi commented Jul 7, 2016

really nice work Mike.

  • tobi
    CEO Shopify

On Thu, Jul 7, 2016 at 8:18 AM, Mike Angell notifications@github.com
wrote:

Alright we are passing again. Detection of the whitespace control
characters has now been pushed back in to the tokenizer so there is no need
to push and pop any existing values.


You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
#30 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AAABW7m4s55YijtSESQGVQYhVRGXmrgdks5qTO6AgaJpZM4I_Yjh
.

VALUE str = rb_enc_str_new(token_start, end - token_start, utf8_encoding);
if(token.rstrip)
rb_funcall(str, intern_rstrip, 0);

This comment has been minimized.

@pushrax

pushrax Jul 7, 2016

Member

What do you think about adding a reverse_read_while if the lstrip part is using read_while anyway?

This comment has been minimized.

@evulse

evulse Jul 8, 2016

Yeah I actually was thinking to do exactly that. I really only stuck to this while refactoring so I could confirm i didn't introduce anything new. Should have some time tonight to pull this out and bring it up to where the lstrip sits.

@pushrax

This comment has been minimized.

Member

pushrax commented Jul 7, 2016

Definitely the better solution, nice.

@evulse

This comment has been minimized.

evulse commented Jul 8, 2016

Ok new change from using ruby rstrip is done which makes this pretty final. As such I've just run the final performance benchmark

Current

Calculating -------------------------------------
          parse:     80.921  (± 3.7%) i/s -      4.851k in  60.028478s
    parse & run:     32.196  (± 3.1%) i/s -      1.932k in  60.069380s

This feature

Calculating -------------------------------------
          parse:     80.791  (± 3.7%) i/s -      4.844k in  60.031186s
    parse & run:     32.109  (± 3.1%) i/s -      1.926k in  60.046058s

The results show the performance impact of this feature is very minimal.

@evulse

This comment has been minimized.

evulse commented Jul 12, 2016

@pushrax @tobi @dylanahsmith Where to from here? We would love to be able to start using this

while (cursor < last) {
if (*cursor++ != '{')
continue;
char c = *cursor++;
char w = *cursor++;

This comment has been minimized.

@dylanahsmith

dylanahsmith Jul 12, 2016

Member

This can read past the end of the string. E.g. tokenizing "{{" results in a token with "{{\u0000". The while (cursor < last) check only allows us to read two characters before having to do another bounds check to see if we can read another byte.

token->type = TOKEN_RAW;
cursor -= 2;
cursor -= 3;
token->rstrip = (w == '-');

This comment has been minimized.

@dylanahsmith

dylanahsmith Jul 12, 2016

Member

I would keep the lstrip and rstrip initialization, then replace the other above changes to this function with

if (cursor <= last && *cursor == '-') {
  cursor++;
  token->rstrip = 1;
}
@@ -90,10 +99,13 @@ void tokenizer_next(tokenizer_t *tokenizer, token_t *token)
if (c != '}')
continue;
token->type = TOKEN_TAG;
if(token->str[cursor - tokenizer->cursor - 3] == '-')

This comment has been minimized.

@dylanahsmith

dylanahsmith Jul 12, 2016

Member

Why not check cursor[-3] instead of token->str[cursor - tokenizer->cursor - 3]

goto found;
}
// unterminated tag
cursor = tokenizer->cursor + 2;
cursor = tokenizer->cursor + 3;

This comment has been minimized.

@dylanahsmith

dylanahsmith Jul 12, 2016

Member

Why are you taking three characters for an unterminated tag. E.g. this means '{{ab' would get tokenized into ['{{a', 'b'] instead of ['{{', 'ab']. If the intention was to include the whitespace control character in the unterminated token, then I don't see the point in doing so.

cursor -= 3;
token->rstrip = (w == '-');
if(tokenizer->lstrip_flag)
token->lstrip = 1;

This comment has been minimized.

@dylanahsmith

dylanahsmith Jul 12, 2016

Member

You can remove the conditional and just set token->lstrip = tokenizer->lstrip_flag;

goto found;
}
}
cursor = last + 1;
if(tokenizer->lstrip_flag)
token->lstrip = 1;

This comment has been minimized.

@dylanahsmith

dylanahsmith Jul 12, 2016

Member

This can also unconditionally do token->lstrip = tokenizer->lstrip_flag;

You should follow that with tokenizer->lstrip_flag = 0; as well, like you did above

VALUE var = rb_class_new_instance(2, args, cLiquidVariable);
rb_ary_push(nodelist, var);
rb_ivar_set(self, intern_blank, Qfalse);
break;
}
case TOKEN_TAG:
{
const char *start = token.str + 2, *end = token.str + token.length - 2;
const char *start = token.str + (2 + token.lstrip), *end = token.str + (token.length - 2 - token.rstrip);

This comment has been minimized.

@dylanahsmith

dylanahsmith Jul 12, 2016

Member

The brackets are unnecessary the way you are using them

@dylanahsmith

This comment has been minimized.

Member

dylanahsmith commented Jul 12, 2016

There were some corner cases that you missed as noted above. But overall this is looking very good now. Nice work

@evulse

This comment has been minimized.

evulse commented Jul 13, 2016

@dylanahsmith is this what you were thinking?

while (cursor < last) {
if (*cursor++ != '{')
continue;
char c = *cursor++;
if (cursor <= last && *cursor == '-') {
cursor++;
token->rstrip = 1;

This comment has been minimized.

@dylanahsmith

dylanahsmith Jul 13, 2016

Member

If the following if (c != '%' && c != '{') branch is taken, then this change won't be reversed.

goto found;
}
token->type = TOKEN_INVALID;
token->lstrip = token->rstrip;

This comment has been minimized.

@dylanahsmith

dylanahsmith Jul 13, 2016

Member

This will leave token->rstrip set to 1, even though it is just the left whitespace control character that has been found. Below you only conditionally set token->rstrip = 1, so left whitespace control character will always result in the rstrip flag being set on the token.

@evulse

This comment has been minimized.

evulse commented Jul 14, 2016

I think just shifting this removes this issue. So flowing through this previously this would have allowed {.- where . being any character. However i think shifting this only allows {%- and {{- as the tag. So the only flows are

  • These 3 characters are at the end of a string and become a raw token
  • These 3 characters are at the start and become a token or variable and run their own loops
  • This is the last token and as such there will be no other tag following that will trigger the lstrip
@evulse

This comment has been minimized.

evulse commented Jul 18, 2016

Is there anything else needed for this to progress?

@dylanahsmith

This comment has been minimized.

Member

dylanahsmith commented Jul 18, 2016

It looks like https://github.com/Shopify/liquid-c/pull/30/files#r70563502 still hasn't been addressed. That was my last concern.

reset rstrip after deciding this is not a raw token as tags and varia…
…bles set these based on their behaviour to raw tags
@evulse

This comment has been minimized.

evulse commented Jul 20, 2016

Ok i've added a reset after this line so it will only be set if it is found.

@evulse

This comment has been minimized.

evulse commented Aug 1, 2016

Ok this and the ruby version look like they are ready to go. What do we need to do to get this moving.

@fw42 fw42 merged commit 1fa04f1 into Shopify:master Aug 11, 2016

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment