Home

Advice for working with encodings

Read Joel Spolsky's Guide to Unicode and Character Sets.
Keep in mind that all external resources (files, http pages, etc) are byte sequences, and are naturally represented in Node.js program as Buffer-s or streams of Buffer-s, not strings.
When you read from an external source and would like to convert data to strings:
- Be sure to know the character encoding of the data. In general, it cannot be deduced automatically.
- Provide original Buffer-s as the input to decode() function, as well as the correct encoding name.
- If you have strings at some place in your program, then decoding has already happened, likely using 'utf-8' encoding. You cannot convert it to another encoding at this stage. You need to get the original Buffers, concat() them if needed, and pass these to iconv-lite. See more details.
- It is tricky to convert encodings when you get data as a Node stream. In these cases, use Streaming API (e.g. iconv. decodeStream()) to make sure that the boundary cases are handled.
When you write to an external resource:
- Decide which encoding you would like to use. Most popular and safe is utf-8, and this is the default in Node.
- Use Streaming API if you work with streams.
- If you don't encode strings yourself, then Node.js will do that for you, with default encoding.
FYI, javascript strings are stored in memory as a UTF-16 encoding.
- If you work with Chinese ideographs or rare characters outside Basic Multilingual Plane, be sure to familiarize yourself with Surrogate pairs. They can be a pain to work with.

How to / Internals

Q: How encoding names are matched?
A: 1) They are lowercased, all non-alphanumeric characters are removed, 2) used as a key in iconv.encodings object to retrieve the codec.

Q: How do I add aliases to encodings?
A: In your project, iconv.encodings['newalias'] = 'encoding'. Alias must be lowercase and have all non-alphanum characters removed.

Q: How do I add a new single-byte encoding?
A: See encodings/sbcs-data.js for an example of 'maccenteuro' encoding.

Q: How do I add a new multi-byte encoding?
A: See generation/gen-dbcs.js and encodings/dbcs-data.js for how it's done. Just add sources for your encoding there. Current multi-byte codec is very versatile, should be enough for most encodings.

Q: What is the format of tables (encodings/tables/*)?
A: It is a JSON array of chunks. Each chunk represents a continuous mapping from multibyte encoding to unicode. First element of a chunk is a hexadecimal 'address': what multibyte code corresponds to the chunk start. Then, there's a mix of strings and integers. String represents unicode chars that correspond to sequential multibyte codes. Integer represents length of a run of incrementing unicode chars, started from the last char of previous string, a-la RLE encoding.

Q: Why this format was chosen?
A: It's visual. You can easily check that the table is correct. Also, it's quite compact and easy to work with, as it's just JSON.

Q: How do I add a completely new encoding, not reducible to multi-byte? (stateful for example)
A: You'll need to write codec for it. Please look at examples in encodings/internal.js, encodings/sbcs-codec.js and encodings/dbcs-codec.js. Don't forget to write tests.

Q: What directories are necessary for this module to work?
A: Please look at .npmignore for directories that can be ignored. All others are necessary.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Advice for working with encodings

How to / Internals

Clone this wiki locally