New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change character units from UTF-16 code unit to Unicode codepoint #376

Open
MaskRay opened this Issue Jan 13, 2018 · 19 comments

Comments

Projects
None yet
@MaskRay
Copy link

MaskRay commented Jan 13, 2018

Text document offsets are based on a UTF-16 string representation. This is strange enough in that text contents are transmitted in UTF-8.

Text Documents
......... The offsets are based on a UTF-16 string representation.

Here in TextDocumentContentChangeEvent, range is specified in UTF-16 column offsets while text is transmitted in UTF-8.

interface TextDocumentContentChangeEvent {
	range?: Range;
	rangeLength?: number;
	text: string;
}

Is it more reasonable to unify these, remove UTF-16 from the wording, and use UTF-8 as the solely used encoding? Line/character can be measured in units of Unicode codepoints, instead of UTF-16 code units.
A line cannot be too long and thus doing extra computing to get the N'th Unicode codepoint would not lay too much burden on editors and language servers.

cquery-project/cquery#57

@MaskRay MaskRay changed the title Change character offset calculation from UTF-16 to UTF-8 Change character units from UTF-16 code unit to Unicode codepoint Jan 13, 2018

@szatanjl

This comment has been minimized.

Copy link

szatanjl commented Jan 18, 2018

I would suggest to go even one step further. Why editors and servers should know which bytes form a Unicode codepoint. Right now specification states it supports only utf-8 encoding, but with Content-Type header I guess there is an idea of supporting other encodings in the future too. I think it would be even better then to use number of bytes instead of UTF-16 code unit or Unicode codepoint.

@dbaeumer

This comment has been minimized.

Copy link
Member

dbaeumer commented Jan 18, 2018

@MaskRay we need to distinguish between the encoding used to transfer the JSON-RPC message. We currently use utf-8 here but as the header indicates this can be change to any encoding assuming that the encoding is supported in all libraries (for example node per default as only a limited set on encodings).

The column offset in a document assumes that after the JSON-RPC message as been decoded when parsing the string document content needs to be stored in UTF-16 encoding. We choose UTF-16 encoding here since most language store strings in memory in UTF-16 not UTF-8. To save one encoding pass we could transfer the JSON-RPC message in UTF-16 instead which is easy to support.

If we want to support UTF-8 for internal text document representation and line offsets this would be a breaking change or needs to be a capability the client announces.

Regarding byte offsets: there was another discussion whether the protocol should be offset based. However the protocol was design to support tools and their UI a for example a reference match in a file could not be rendered using byte offsets in a list. So the client would need to read the content of the file and convert the offset in line / column. We decided to let the server do this since the server very likely has read the file before anyways.

@Marwes

This comment has been minimized.

Copy link
Contributor

Marwes commented Jan 18, 2018

We choose UTF-16 encoding here since most language store strings in memory in UTF-16 not UTF-8.

Source? Isn't the only reason for this is that Java/Javascript/C# uses UTF-16 as their string representation? I'd say there is a good case to made that (in hindsight) UTF-16 was a poor choice for string type in those language as well which makes it dubious to optimize for that case. The source code itself is usually UTF-8 (or just ascii) and as has been said this is also the case when transferring over JSON-RPC so I'd say the case is pretty strong for assuming UTF-8 instead of UTF-16.

@puremourning

This comment has been minimized.

Copy link

puremourning commented Jan 18, 2018

We choose UTF-16 encoding here since most language store strings in memory in UTF-16 not UTF-8. To save one encoding pass we could transfer the JSON-RPC message in UTF-16 instead which is easy to support.

Citation needed? ;)

Of the 7 downstream language completers we support in ycmd:

  • 1 uses byte offsets (libclang)
  • 6 use unicode code points (gocode, tern, tsserver*, jedi, racer, omnisharp*)
  • 0 use utf 16 code units

* full disclosure, I think these use code points, else we have a bug!

The last is a bit of a fib, because we're integrating Language Server API for java.

However, as we receive byte offsets from the client, and internally use unicode code points, we have to reencode the file as utf 16, do a bunch of hackery to count the code units, then send the file, encoded as utf8 over to the language server, with offsets in utf 16 code units.

Of the client implementations of ycmd (there are about 8 I think), all of them are able to provide line-byte offsets. I don't know for certain all of them, but certainly the main one (Vim) is not able to provide utf 16 code units; they would have to be calculated.

Anyway, the point is that it might not be as simple as originally thought :D Though I appreciate that a specification is such, and changing it would be breaking. Just my 2p

@puremourning

This comment has been minimized.

Copy link

puremourning commented Jan 19, 2018

Not that SO is particularly reliable, but it happens to support my point, so I'm shamelessly going to quote from: https://stackoverflow.com/questions/30775689/python-length-of-unicode-string-confusion

You have 5 codepoints. One of those codepoints is outside of the Basic Multilingual Plane which means the UTF-16 encoding for those codepoints has to use two code units for the character.

In other words, the client is relying on an implementation detail, and is doing something wrong. They should be counting codepoints, not codeunits. There are several platforms where this happens quite regularly; Python 2 UCS2 builds are one such, but Java developers often forget about the difference, as do Windows APIs.

@MaskRay

This comment has been minimized.

Copy link
Author

MaskRay commented Jan 19, 2018

Emacs uses some extended UTF-8 and its functions return numbers in units of Unicode codepoints.

https://github.com/emacs-lsp/lsp-mode/blob/master/lsp-methods.el#L657

@vibhavp for Emacs lsp-mode internal representation

@szatanjl

This comment has been minimized.

Copy link

szatanjl commented Jan 19, 2018

I am sorry in advance if I am telling something stupid right now. I have a question to you guys.

My thought process is that if there is a file in different encoding than any utf, and we use other encoding than utf in JSON-RCP (which can happen in future) then why would there be any need for the client and server to know what Unicode is at all?

Of the client implementations of ycmd (there are about 8 I think), all of them are able to provide line-byte offsets.

That's it. It is easy to provide line-byte offset. So why would it be better to use Unicode codepoints instead of bytes?

Let's say for example we have file encoded in iso-8859-1 and we use the same encoding for JSON-RPC communication. There is a character ä (0xE4) that can be represented at least in two ways in Unicode: U+00C4 (ä) or U+0061 (a) U+0308 (¨ - combining diaeresis). Former is one unicode codepoint, latter is two, and both are equally good and correct. If client uses one and server another we have a problem. Simply using line-byte offset here we would avoid these problems.

@dbaeumer I think we misunderstood each other or at least I did. I didn't mean to use byte offset from beginning of the file which would require client to convert it but to still use {line, column} pair. But count column in bytes instead of utf-16 code units or unicode codepoints.

@eonil

This comment has been minimized.

Copy link

eonil commented Jan 22, 2018

We choose UTF-16 encoding here since most language store strings in memory in UTF-16 not UTF-8.

If we want to support UTF-8 for internal text document representation and line offsets this would be a breaking change or needs to be a capability the client announces.

Are you serious? UTF-16 is one of worst choice of old days due to lack of alternative solutions. Now we have UTF-8, and to choose UTF-16, you need a real good reason rather than a very brave assumption on implementation details of every softwares in the world especially if we consider future softwares.

This assumption is very true on Microsoft platforms which will never consider UTF-8. I think some bias to Microsoft is unavoidable as leadership of this project is from Microsoft, but this is too much. This reminds me Embrace, extend, and extinguish strategy. If this is the case, this is an enough reason to boycott LSP for me. Because we gonna see this kind of Microsoft-ish nonsense decision making forever.

@kdvolder

This comment has been minimized.

Copy link

kdvolder commented Jan 22, 2018

Just to be clear, I don't work for Microsoft, and generally haven't been a big fan of them (being a Linux user myself). But I feel compelled to defend the LSP / vscode team here. I really don't think there's a big conspiracy theory here. From where I stand, it looks to me like Vscode and LSP teams are doing their very best to be inclusive and open.

The UTF-8 vs UTF-16 choice may seem like a big and important point to some, but to others, including myself, the choice probably seems somewhat arbitrary. For decisions like these, it is natural to write into the spec something that confirms to your current prototype implementation for choices like these, and I think this is perfectly reasonable.

Some may think that is a mistake. As this is an open spec and subject to change / revision/ discussion, everyone is free to voice their opinion and argue what choice is right and whether it should be changed... but I think such discussions should stick to technical arguments there's no need to resort to insinuations of a Microsoft conspiracy theory (moreover, these insinuations are really unwarranted here, in my opinion).

@eonil

This comment has been minimized.

Copy link

eonil commented Jan 23, 2018

I apology for involving my political view in my comment. I was over-sensitive due to traumatic memory from Microsoft in old days. Now I see this spec is in progress and subject to change.

I didn't mention technical reasons because these are mainly repetition of other people's opinion or well known. Anyway, I list my technical reasons here.

  • IMO, UTF-8 is present and future, and UTF-16 is legacy to avoid. The reason is here.
  • By requiring dependency to UTF-16, LSP effectively forces program implementation to involve the legacy.
  • Simplicity is better than extra complexity and dependency. One encoding for everywhere is better.
  • More complexity and dependency increases amount of work of implementation a lot.
  • AFAIK, Converting indices between different Unicode encodings are very expensive.
  • LSP is a new protocol. No reason to involve a bad legacy. The only benefit here is potential benefit to specific platforms with native UTF-16 native strings.
  • For now, the only reason to require UTF-16 is to give such benefit to specific implementations.
  • Other platforms wouldn't be very happy due to increased complexity and potential performance penalty in implementation.
  • Such unfair benefit is likely going to break community.

... or needs to be a capability the client announces

I think this is fine. An optional field which designates encoding mode of indices beside the index numbers. If the encoding mode is set to utf-8, interpret the numbers as UTF-8 code points, and if it is utf-16, interpret them as UTF-16 code point. If the field is missing, fallback to UTF-16 for legacy compatibility.

@sam-mccall

This comment has been minimized.

Copy link

sam-mccall commented Apr 13, 2018

This is causing us some implementation difficulty in clangd, which needs to interop with external indexes.
Using UTF-16 on network protocols is rare, so requiring indexes to provide UTF-16 column numbers is a major yak-shave and breaks abstractions.

@udoprog

This comment has been minimized.

Copy link

udoprog commented Apr 13, 2018

Yup, same problem here working on reproto/reproto#34.

This would be straight forward if "Line/character can be measured in units of Unicode codepoints" as stated in the original description.

@dbaeumer

This comment has been minimized.

Copy link
Member

dbaeumer commented Apr 27, 2018

As mention in one of my first comments this needs to be backwards compatible if introduced. An idea would be:

  • client announces the encodings it supports for position encodings.
  • server picks an encoding to use.

If no common encoding can be found the server will not functioning with the client. So at the end such a change will force clients to support the union set of commonly used encodings. Given this I am actually not sure if the LSP server Eco system will profit from such a change (a server using an encoding not widely adopted by clients is of limited use from an Eco system perspective). On the other hand we only have a limited number of clients compared to a large number of servers. So it might not be too difficult to do the adoption for the clients.

I would appreciate a PR for this that for example does the following:

@jclc

This comment has been minimized.

Copy link

jclc commented May 13, 2018

What about using byte indices directly? Using codepoints still requires to go through every single character.

@udoprog

This comment has been minimized.

Copy link

udoprog commented May 13, 2018

@jclc using byte indices is not a bad idea, but I want to outline the implications of such a choice:

Either servers or clients need to communicate which encoding ranges are sent in, and one of them needs to adapt to the others requirements. Since clients are less numerous, it would seem the more economic choice for this responsibility to fall on them.
In order to be backwards compatible the exchange has to be bi-directional. All servers have to support UTF-16 and fall back to this when the client indicates that this is their only capability. At least until a new major revision of the LSP has been widely adopted and the old one deprecated.

Using codepoints still requires to go through every single character.

This depends a bit on the language, but rows are generally unambiguous. They can be stored in such a way that we don't have to decode all characters up until that row (e.g. when using a specialized rope structure). With this approach we only have to decode the content of the addressed rows. Some transcoding work will happen unless the internal encoding of both server and client matches.

Edit: The reason I have a preference for codepoints over bytes is that they are inherently unambiguous. All languages dealing with unicode must have ways for traversing over strings and relating the number of codepoints to indexes regardless of what specific encodings are well-supported.

@eonil

This comment has been minimized.

Copy link

eonil commented May 14, 2018

I think every problems arise from lack of precise definition of "character" in LSP spec. The term "character" has been used everywhere in the spec, but the term is not actually well-defined independently.

Anyway, LSP3 spec defines "character offset" as UTF-16 Code Unit, which means it implicitly defines term "character" as UTF-16 Code Unit as well. This is (1) non-sense as Unicode Code Unit is not intended to be a character and (2) inconsistent with other part where UTF-8 based.

In my opinion, the first thing we have to do is defining term "character" precisely, or replacing the term "character" with something else. Lack of precise definition of term "character" increases ambiguity and potential bugs.


As far as I know, Unicode defines three considerable concepts of text assemblies.

  • Code Unit
  • Code Point
  • Grapheme Cluster

And the closest concept to human's perceived "character" is "Grapheme Cluster" as it counts number of glyphs rather than code.

As @udoprog pointed out, transcoding cost is negligible, so accept the cost and choose logically ideal one -- Grapheme Cluster counting. This is better than Code Point and less ambiguous in my opinion.

Furthermore, Grapheme Cluster count is very likely being tracked by code editors to provide precise line/column(or character offset) information to end users. Tracking of Grapheme Cluster count wouldn't be a problem for them.

There will be two distinctive position/offset counting mode (1) legacy-compatible and (2) grapheme-cluster-counting.

  • legacy-compatible mode is same with current. Defines "character" as UTF-16 Code Unit.
  • grapheme-cluster-counting mode defines "character" as "Grapheme Cluster" and uses count of Grapheme Cluster as position offset.

In LSP3, servers should support both of legacy-compatible(deprecated but default) and grapheme-cluster-counting modes.
In LSP4, grapheme-cluster-counting is the only counting method.


If Grapheme Cluster counting is unacceptable, UTF-8 Code Unit (==encoded byte count) counting can be considered instead. Character offset becomes irregular indexing number, but it'll be consistent with other part of the spec.

@udoprog

This comment has been minimized.

Copy link

udoprog commented May 14, 2018

@eonil Regarding grapheme clusters:

The exact composition of clusters is permitted to vary across (human) languages and locales (Tailored grapheme clusters). They naturally vary from one revision to another of the unicode spec as new clusters are added. Finally, iterating over grapheme clusters is not commonly found in standard libraries in my experience.

@eonil

This comment has been minimized.

Copy link

eonil commented May 14, 2018

@udoprog I see. If grapheme clusters are unstable across implementations, I think it should not be used.

@oblitum

This comment has been minimized.

Copy link

oblitum commented Dec 21, 2018

Sorry to say but this is some kind of WCHAR legacy at full force.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment