Skip to content
Alexander Shtuchkin edited this page May 17, 2020 · 5 revisions

Advice for working with encodings

  • Read Joel Spolsky's Guide to Unicode and Character Sets.
  • Keep in mind that all external resources (files, http pages, etc) are byte sequences, and are naturally represented in Node.js program as Buffer-s or streams of Buffer-s, not strings.
  • When you read from an external source and would like to convert data to strings:
    • Be sure to know the character encoding of the data. In general, it cannot be deduced automatically.
    • Provide original Buffer-s as the input to decode() function, as well as the correct encoding name.
    • If you have strings at some place in your program, then decoding has already happened, likely using 'utf-8' encoding. You cannot convert it to another encoding at this stage. You need to get the original Buffers, concat() them if needed, and pass these to iconv-lite. See more details.
    • It is tricky to convert encodings when you get data as a Node stream. In these cases, use Streaming API (e.g. iconv. decodeStream()) to make sure that the boundary cases are handled.
  • When you write to an external resource:
    • Decide which encoding you would like to use. Most popular and safe is utf-8, and this is the default in Node.
    • Use Streaming API if you work with streams.
    • If you don't encode strings yourself, then Node.js will do that for you, with default encoding.
  • FYI, javascript strings are stored in memory as a UTF-16 encoding.
    • If you work with Chinese ideographs or rare characters outside Basic Multilingual Plane, be sure to familiarize yourself with Surrogate pairs. They can be a pain to work with.

How to / Internals

Q: How encoding names are matched?
A: 1) They are lowercased, all non-alphanumeric characters are removed, 2) used as a key in iconv.encodings object to retrieve the codec.

Q: How do I add aliases to encodings?
A: In your project, iconv.encodings['newalias'] = 'encoding'. Alias must be lowercase and have all non-alphanum characters removed.

Q: How do I add a new single-byte encoding?
A: See encodings/sbcs-data.js for an example of 'maccenteuro' encoding.

Q: How do I add a new multi-byte encoding?
A: See generation/gen-dbcs.js and encodings/dbcs-data.js for how it's done. Just add sources for your encoding there. Current multi-byte codec is very versatile, should be enough for most encodings.

Q: What is the format of tables (encodings/tables/*)?
A: It is a JSON array of chunks. Each chunk represents a continuous mapping from multibyte encoding to unicode. First element of a chunk is a hexadecimal 'address': what multibyte code corresponds to the chunk start. Then, there's a mix of strings and integers. String represents unicode chars that correspond to sequential multibyte codes. Integer represents length of a run of incrementing unicode chars, started from the last char of previous string, a-la RLE encoding.

Q: Why this format was chosen?
A: It's visual. You can easily check that the table is correct. Also, it's quite compact and easy to work with, as it's just JSON.

Q: How do I add a completely new encoding, not reducible to multi-byte? (stateful for example)
A: You'll need to write codec for it. Please look at examples in encodings/internal.js, encodings/sbcs-codec.js and encodings/dbcs-codec.js. Don't forget to write tests.

Q: What directories are necessary for this module to work?
A: Please look at .npmignore for directories that can be ignored. All others are necessary.