Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode whitespace support #86

Closed
grammarware opened this issue Dec 23, 2012 · 6 comments
Closed

Unicode whitespace support #86

grammarware opened this issue Dec 23, 2012 · 6 comments
Labels

Comments

@grammarware
Copy link
Member

aka "fix utf plz"

In particular, the nasty \uC2A0 (unbreakable space) character just made me waste quite a lot of time and even made me believe for a second that I have a problem in my grammar.

Details: if the layout/whitespace definition is the usual [\ \n\t\r]+ !>> [\ \n\t\r], and this character is in the input stream, a parse error occurs. However, if we add it to the layout definition, the same error occurs (my guess is that it is somehow normalised to the normal space). In fact, even replaceAll(inputString,"\uC2A0"," ") does not solve the problem (again, my guess is that some normalisations occur somewhere).

I know enough Perl to overcome this for now, but I love Rascal much more ;)

@DavyLandman
Copy link
Member

Are you running an old version of rascal? I seem to remember fixing this
about 2 months ago?
On Dec 23, 2012 4:38 PM, "Vadim Zaytsev" notifications@github.com wrote:

aka "fix utf plz"

In particular, the nasty \uC2A0 (unbreakable space) character just made me
waste quite a lot of time and even made me believe for a second that I have
a problem in my grammar.

Details: if the layout/whitespace definition is the usual [\ \n\t\r]+ !>>
[\ \n\t\r], and this character is in the input stream, a parse error
occurs. However, if we add it to the layout definition, the same error
occurs (my guess is that it is somehow normalised to the normal space). In
fact, even replaceAll(inputString,"\uC2A0"," ") does not solve the problem
(again, my guess is that some normalisations occur somewhere).

I know enough Perl to overcome this for now, but I love Rascal much more ;)


Reply to this email directly or view it on GitHubhttps://github.com//issues/86.

@grammarware
Copy link
Member Author

Rascal plugin 0.5.2 on Juno

@DavyLandman
Copy link
Member

okay, so reading this again.

  1. rascal used to be sensitive to this issue to. this was fixed in c23269a (2 months after the release of 0.5.2)
  2. @jurgenvinju seems to have added it to the lang::std::Whitespace but not to the lang::std::Layout follow restriction.
  3. the normalization of the char in the class is perhaps a different bug?

so workaround: add the unicode ranges from the new lang::std::Whitespace or update and use lang::std::Whitespace + perhaps fix the lang::std::Layout.

regarding 3, try if this is still an issue with current release, since 0.5.2 is about 4 months old.

@grammarware
Copy link
Member Author

kthx, works now!

btw, it turned out I already ran 0.5.4. At least my Eclipse says so, while the website said 0.5.2 was the last one.

@anyahelene
Copy link
Contributor

Davy Landman notifications@github.com wrote:

okay, so reading this again.

  1. rascal used to be sensitive to this issue to. this was fixed in
    c23269a (2 months after the release of
    0.5.2)
  2. @jurgenvinju seems to have added it to the
    lang::std::Whitespace
    but not to the lang::std::Layout follow restriction.
  3. the normalization of the char in the class is perhaps a different
    bug?

so workaround: add the unicode ranges from the new
lang::std::Whitespace or update and use lang::std::Whitespace +
perhaps fix the lang::std::Layout.

regarding 3, try if this is still an issue with current release, since
0.5.2 is about 4 months old.


Reply to this email directly or view it on GitHub:
#86 (comment)

Do we have \s as in Perl? Unless we have an (old fashioned) language definition that says otherwise, treating any unicode space as space is probably better than enumerating the characters. I see Vadim has forgotten formfeed, for example, which is commonly accepted as space (heavily used in Gnu code).

-anya

-anya

@grammarware
Copy link
Member Author

Well, using unicode ranges should do the trick, since I hope that any possible whitespace character should be found in the whitespace category in the unicode standard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants