Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lexer positions vs multi-byte file encodings #5163

Closed
nadako opened this issue Apr 25, 2016 · 8 comments
Closed

Lexer positions vs multi-byte file encodings #5163

nadako opened this issue Apr 25, 2016 · 8 comments
Milestone

Comments

@nadako
Copy link
Member

nadako commented Apr 25, 2016

We discussed it before with @Simn and decided that it's probably a Haxe 4 material, still, I'm creating this issue so we don't forget about this.

Because our lexer isn't encoding-aware it reports byte positions within lines instead of character positions, which creates all kind of issues for the IDEs when code contains non-ascii strings and comments.

class Main {
static function main() {
/* Privet */ var someVar1; // main.hx:3: characters 17-25
/* Привет */ var someVar2; // main.hx:4: characters 23-31
}
}

Sometimes, this can be corrected by the IDE, by reading file again, splitting it in lines, getting the specific line and encoding-decoding it, like it's done for example in haxe-languageserver, which is ugly and inefficient enough already.

However, often editors don't even provide a way to execute that kind of code. For example, both Sublime Text and VS Code provide a way to reckognize and extract compilation errors from the output, using regex group that capture line and character number. Then it uses that info to goto/underline the problem place. And if we have multi-byte strings/comments on that line, we'll end up with a complete off position.

I think, in 2016 it's safe and fair to require source files to be in UTF-8 encoding and report correct positions for that.


PS While we're at it, we also should think about making reported character offsets 1-based instead of 0-based, to be consistent with our line numbers. This is also an issue with both ST and VSCode - they currently underline positions shifted to the left by 1 char.

@Simn Simn added this to the 4.0 milestone Apr 25, 2016
@nadako
Copy link
Member Author

nadako commented Apr 25, 2016

We should look at this: https://github.com/whitequark/ulex, maybe it's not hard at all to just plug it in instead of standard lexing.

@Simn
Copy link
Member

Simn commented Apr 25, 2016

I was kind of hoping to slowly move away from camlp4o, not towards it...

@nadako
Copy link
Member Author

nadako commented Apr 25, 2016

Meh, what's the alternative?

@Simn
Copy link
Member

Simn commented Apr 25, 2016

PPX: http://whitequark.org/blog/2014/04/16/a-guide-to-extension-points-in-ocaml/

As it happens the lexer you mentioned has been ported to that anyway: https://github.com/alainfrisch/sedlex

@nadako
Copy link
Member Author

nadako commented Apr 25, 2016

Oh, nice. And we can all switch to ocaml 4 for that \o/ :-)

@Simn
Copy link
Member

Simn commented Apr 25, 2016

Yes we'll have to do that eventually anyway. I just hope this is more or less easily doable for dirty Windows peasants.

@nadako
Copy link
Member Author

nadako commented Apr 25, 2016

Well, I was using OCaml 4 with this http://protz.github.io/ocaml-installer/ from day one without issues.

@nadako
Copy link
Member Author

nadako commented Aug 18, 2016

Alright, this was fairly easy: https://github.com/nadako/haxe/tree/sedlex

I love this to be merged someday, but we need to figure out our build flow. Right now, I just hacked in ocamlfind into a Makefile which may or may not be the good solution to this, but it was surprisinigly easy. Anyway, that's a question for #5174.

As for this issue, I'd love someone to check it out. For me it reports proper expression positions with cyrillic letters in utf-8 and i'm very happy about it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants