Lexer positions vs multi-byte file encodings #5163

nadako · 2016-04-25T12:53:40Z

We discussed it before with @Simn and decided that it's probably a Haxe 4 material, still, I'm creating this issue so we don't forget about this.

Because our lexer isn't encoding-aware it reports byte positions within lines instead of character positions, which creates all kind of issues for the IDEs when code contains non-ascii strings and comments.

class Main {
static function main() {
/* Privet */ var someVar1; // main.hx:3: characters 17-25
/* Привет */ var someVar2; // main.hx:4: characters 23-31
}
}

Sometimes, this can be corrected by the IDE, by reading file again, splitting it in lines, getting the specific line and encoding-decoding it, like it's done for example in haxe-languageserver, which is ugly and inefficient enough already.

However, often editors don't even provide a way to execute that kind of code. For example, both Sublime Text and VS Code provide a way to reckognize and extract compilation errors from the output, using regex group that capture line and character number. Then it uses that info to goto/underline the problem place. And if we have multi-byte strings/comments on that line, we'll end up with a complete off position.

I think, in 2016 it's safe and fair to require source files to be in UTF-8 encoding and report correct positions for that.

PS While we're at it, we also should think about making reported character offsets 1-based instead of 0-based, to be consistent with our line numbers. This is also an issue with both ST and VSCode - they currently underline positions shifted to the left by 1 char.

nadako · 2016-04-25T13:23:09Z

We should look at this: https://github.com/whitequark/ulex, maybe it's not hard at all to just plug it in instead of standard lexing.

Simn · 2016-04-25T13:28:48Z

I was kind of hoping to slowly move away from camlp4o, not towards it...

nadako · 2016-04-25T13:33:20Z

Meh, what's the alternative?

Simn · 2016-04-25T13:36:39Z

PPX: http://whitequark.org/blog/2014/04/16/a-guide-to-extension-points-in-ocaml/

As it happens the lexer you mentioned has been ported to that anyway: https://github.com/alainfrisch/sedlex

nadako · 2016-04-25T13:38:00Z

Oh, nice. And we can all switch to ocaml 4 for that \o/ :-)

Simn · 2016-04-25T13:40:18Z

Yes we'll have to do that eventually anyway. I just hope this is more or less easily doable for dirty Windows peasants.

nadako · 2016-04-25T13:41:35Z

Well, I was using OCaml 4 with this http://protz.github.io/ocaml-installer/ from day one without issues.

nadako · 2016-08-18T18:07:13Z

Alright, this was fairly easy: https://github.com/nadako/haxe/tree/sedlex

I love this to be merged someday, but we need to figure out our build flow. Right now, I just hacked in ocamlfind into a Makefile which may or may not be the good solution to this, but it was surprisinigly easy. Anyway, that's a question for #5174.

As for this issue, I'd love someone to check it out. For me it reports proper expression positions with cyrillic letters in utf-8 and i'm very happy about it!

Simn added this to the 4.0 milestone Apr 25, 2016

nadako mentioned this issue Apr 26, 2016

Moving to OCaml 4 #5174

Closed

skial mentioned this issue Aug 19, 2016

Haxe Roundup 368 skial/haxe.io#325

Closed

Gama11 mentioned this issue Mar 2, 2017

Invalid positions reported by diagnostics when file is not ascii-only vshaxe/vshaxe#98

Closed

nadako added a commit to nadako/haxe that referenced this issue Apr 11, 2017

use sedlex for utf-8 aware lexing (closes HaxeFoundation#5163)

b587ff0

nadako added a commit to nadako/haxe that referenced this issue Apr 11, 2017

use sedlex for utf-8 aware lexing (closes HaxeFoundation#5163)

4941ec4

nadako mentioned this issue Apr 11, 2017

use sedlex for utf-8 aware lexing (closes #5163) #6172

Merged

nadako added a commit to nadako/haxe that referenced this issue Apr 18, 2017

use sedlex for utf-8 aware lexing (closes HaxeFoundation#5163)

fd704cb

nadako added a commit to nadako/haxe that referenced this issue Apr 28, 2017

use sedlex for utf-8 aware lexing (closes HaxeFoundation#5163)

3f30682

Simn closed this as completed in 2f53fe4 May 1, 2017

skial mentioned this issue May 2, 2017

Haxe Roundups 382 skial/haxe.io#386

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lexer positions vs multi-byte file encodings #5163

Lexer positions vs multi-byte file encodings #5163

nadako commented Apr 25, 2016

nadako commented Apr 25, 2016 •

edited

Loading

Simn commented Apr 25, 2016

nadako commented Apr 25, 2016

Simn commented Apr 25, 2016

nadako commented Apr 25, 2016

Simn commented Apr 25, 2016

nadako commented Apr 25, 2016 •

edited

Loading

nadako commented Aug 18, 2016

Lexer positions vs multi-byte file encodings #5163

Lexer positions vs multi-byte file encodings #5163

Comments

nadako commented Apr 25, 2016

nadako commented Apr 25, 2016 • edited Loading

Simn commented Apr 25, 2016

nadako commented Apr 25, 2016

Simn commented Apr 25, 2016

nadako commented Apr 25, 2016

Simn commented Apr 25, 2016

nadako commented Apr 25, 2016 • edited Loading

nadako commented Aug 18, 2016

nadako commented Apr 25, 2016 •

edited

Loading

nadako commented Apr 25, 2016 •

edited

Loading