-
-
Notifications
You must be signed in to change notification settings - Fork 648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lexer positions vs multi-byte file encodings #5163
Comments
We should look at this: https://github.com/whitequark/ulex, maybe it's not hard at all to just plug it in instead of standard lexing. |
I was kind of hoping to slowly move away from camlp4o, not towards it... |
Meh, what's the alternative? |
PPX: http://whitequark.org/blog/2014/04/16/a-guide-to-extension-points-in-ocaml/ As it happens the lexer you mentioned has been ported to that anyway: https://github.com/alainfrisch/sedlex |
Oh, nice. And we can all switch to ocaml 4 for that \o/ :-) |
Yes we'll have to do that eventually anyway. I just hope this is more or less easily doable for dirty Windows peasants. |
Well, I was using OCaml 4 with this http://protz.github.io/ocaml-installer/ from day one without issues. |
Alright, this was fairly easy: https://github.com/nadako/haxe/tree/sedlex I love this to be merged someday, but we need to figure out our build flow. Right now, I just hacked in As for this issue, I'd love someone to check it out. For me it reports proper expression positions with cyrillic letters in utf-8 and i'm very happy about it! |
We discussed it before with @Simn and decided that it's probably a Haxe 4 material, still, I'm creating this issue so we don't forget about this.
Because our lexer isn't encoding-aware it reports byte positions within lines instead of character positions, which creates all kind of issues for the IDEs when code contains non-ascii strings and comments.
Sometimes, this can be corrected by the IDE, by reading file again, splitting it in lines, getting the specific line and encoding-decoding it, like it's done for example in haxe-languageserver, which is ugly and inefficient enough already.
However, often editors don't even provide a way to execute that kind of code. For example, both Sublime Text and VS Code provide a way to reckognize and extract compilation errors from the output, using regex group that capture line and character number. Then it uses that info to goto/underline the problem place. And if we have multi-byte strings/comments on that line, we'll end up with a complete off position.
I think, in 2016 it's safe and fair to require source files to be in UTF-8 encoding and report correct positions for that.
PS While we're at it, we also should think about making reported character offsets 1-based instead of 0-based, to be consistent with our line numbers. This is also an issue with both ST and VSCode - they currently underline positions shifted to the left by 1 char.
The text was updated successfully, but these errors were encountered: