TEDIT:  characters, bytes, and NS byte encoding

I've started my foray into the innards of Tedit, with the objectives of finding and fixing the bugs we have observed, rationalizing its interaction with different character encodings, and providing a consistent programmatic interface for applications (like LFG) that use the basic character primitives to interpret Tedit streams like streams on any other file.

Tedit was constructed when the NS character encoding was used as the standard not only for associating glyphs with 16 bit codes but also for representing 16 bit codes in byte sequences on files.  The code we have now reflects that evolution, in that there isn't a clean separation between implementation levels that deal in bytes and byte sequences and levels that deal with characters.  The code is complex also because it embodies an enormous number of careful optimizations that may have little value in our current configuration.

This confusion is mostly invisible (modulo bugs) at the editing user-interface level, but it becomes apparent when plain-text files are to be read or written.  And it becomes visible also in the programmatic interface:  it is not entirely clear how the position arguments and values of functions like GETFILEPTR, SETFILEPTR, GETOFPTR are interpreted, whether they are counting bytes in an underlying plaintext file, in some cases, or characters in Tedit files in other cases.  And some times BIN will return an NS character-shift byte (255), and READDCODE will also sometimes do this, if the stream is positioned at the wrong place.

I think this needs to be cleaned up and made consistent, hopefully with the result that the code becomes simpler and more modular.  My initial proposal (comments please) is that the "bytes" of Tedit streams are defined to be 16-bit character codes (with an occasional imageobj from time to time). Essentially that BIN and \INCCODE always return the same things.  And that all of the positional functions count out in characters (SETFILEPTR xxx 25) means that the next BIN or \INCCODE will return the 25th 16-bit code in the stream's enumeration sequence.  This might correspond to a byte at some arbitrarily later position in an underlying file stream, but that is a purely internal fact that will not be exposed to a caller.

Similarly, GETOFPTR of a Tedit stream will return the number of characters in the stream, not the number of bytes that those characters might occupy in any particular byte-sequence representation.

This may introduce a performance delay when Tedit is opened on a plain text file.  In principle, the whole file has to be scanned to figure out how many characters there are and where they are located in the byte sequence, perhaps copying the whole file into an in-core cache.  Opening a very large file might take some time and use of a bit of space.  But that expansion can also be done incrementally:  only cache the characters that are visible initially or through scrolling, or that must be decoded to set to get to some explicit character position (SETFILEPTR to a high value) or to find out the total lenght (GETEOFPTR).  Those characters would be recoded into byte-sequences according to the original file's external format when the file is saved.

In sum:  BIN and INCCODE would return the same 16-bit values for Text streams, position functions would count out in characters, not bytes.  There will be no direct way of figuring out from a character position what its corresponding byte position might be in a backing file, other than by counting out from the beginning.

Tedit streams will use the domestic mapping of codes to glyphs (XCCS, until we get better Unicode fonts), and Tedit binary files will store those character codes as it does now.  Plaintext files will map those domestic codes according to their external formats.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

TEDIT: characters, bytes, and NS byte encoding #861

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

TEDIT: characters, bytes, and NS byte encoding #861

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions