Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nonsensical code in Whitespace->__TOKENIZER__on_char #10

Closed
wchristian opened this issue Dec 18, 2013 · 6 comments
Closed

nonsensical code in Whitespace->__TOKENIZER__on_char #10

wchristian opened this issue Dec 18, 2013 · 6 comments
Assignees
Labels
Milestone

Comments

@wchristian
Copy link
Member

There is a bit there that seems to try to determine whether a character outside of the ASCII range is word or whitespace, however instead of actually looking at the current character it looks at the stringified tokenizer, which is just a perl address. I'm unclear on whether the tokenizer is supposed to stringify, or whether this was just a piece where the meaning of $t changed without the code adapting. Anything but the obvious change to chr($char) you'd like done here?

@adamkennedy
Copy link
Collaborator

The tokenizer doesn't stringy, it's presumably a bug.

Adam
On Dec 18, 2013 5:56 AM, "Christian Walde" notifications@github.com wrote:

There is a bit there that seems to try to determine whether a character
outside of the ASCII range is word or whitespacehttps://github.com/adamkennedy/PPI/blob/master/lib/PPI/Token/Whitespace.pm#L407,
however instead of actually looking at the current character it looks at
the stringified tokenizer, which is just a perl address. I'm unclear on
whether the tokenizer is supposed to stringify, or whether this was just a
piece where the meaning of $t changed without the code adapting. Anything
but the obvious change to chr($char) you'd like done here?


Reply to this email directly or view it on GitHubhttps://github.com//issues/10
.

@wchristian
Copy link
Member Author

Cool, i'll fix that then.

@ghost ghost assigned wchristian Dec 19, 2013
@wchristian
Copy link
Member Author

Quick note: The specific code that's nonsensical here is:

https://metacpan.org/source/ADAMK/PPI-1.215/lib/PPI/Token/Whitespace.pm#L407

That's still not fixed and does need a fix.

@adamkennedy
Copy link
Collaborator

That looks like it might have been a very early attempt to do something at
least remotely usful with Unicode or accented Latin1 characters. Probably
the latter...
On Nov 12, 2014 9:39 AM, "Christian Walde" notifications@github.com wrote:

Quick note: The specific code that's nonsensical here is:

https://metacpan.org/source/ADAMK/PPI-1.215/lib/PPI/Token/Whitespace.pm#L407

That's still not fixed and does need a fix.


Reply to this email directly or view it on GitHub
#10 (comment).

@wchristian
Copy link
Member Author

Alright, i worked out what's going on there.

The code tried to figure out whether a non-ascii char is whitespace or a word char, to proceed with that then; and otherwise throw an error message explaining that it doesn't know what to do with the char. With the original version of the code it ALWAYS recognized it as a word char, which caused weird crashes in Token::Word.

After fixing it to inspect that usefully, the error from utf8 that was transformed into unhandlable mojibake now complains about an unexpected character instead.

This still needs further adressing by way of better input handling as mentioned in #26, but for now this is an improvement.

@wchristian wchristian added this to the 1.222 milestone Nov 13, 2014
@adamkennedy
Copy link
Collaborator

Agreed
On Nov 13, 2014 9:28 AM, "Christian Walde" notifications@github.com wrote:

Alright, i worked out what's going on there.

The code tried to figure out whether a non-ascii char is whitespace or a
word char, to proceed with that then; and otherwise throw an error message
explaining that it doesn't know what to do with the char. With the original
version of the code it ALWAYS recognized it as a word char, which caused
weird crashes in Token::Word.

After fixing it to inspect that usefully, the error from utf8 that was
transformed into unhandlable mojibake now complains about an unexpected
character instead.

This still needs further adressing by way of better input handling as
mentioned in #26 #26, but for
now this is an improvement.


Reply to this email directly or view it on GitHub
#10 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants