Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding conversion #5

Closed
miloyip opened this issue Jun 6, 2014 · 1 comment
Closed

Encoding conversion #5

miloyip opened this issue Jun 6, 2014 · 1 comment

Comments

@miloyip
Copy link
Collaborator

miloyip commented Jun 6, 2014

From milo...@gmail.com on November 27, 2011 00:33:27

Currently, the input and output of Reader uses the same encoding.

It is often needed to read a stream of one encoding (e.g. UTF-8), and output string of another encoding (e.g. UTF-16). Or in the other way, stringify a DOM from one encoding (e.g. UTF-16) to an output stream of another encoding (e.g. UTF-8)

The most simple solution is converting the stream into a memory buffer of another encoding. This requires more memory storage and memory access.

Another solution is to convert the input stream into another encoding before sending it to the parser. However, only characters in JSON string type are really the ones necessary to be converted. Conversion of other characters just wastes time.

The third solution is letting the parser distinguish the input and output encoding. It uses an encoding converter to convert characters of JSON string type. However, since the output length may longer than the original length, in situ parsing cannot be permitted.

Try to design a mechanism to generalize encoding conversion. And it should support UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE. It can also support automatic encoding detection with BOM, while incurring some overheads in dynamic dispatching.

Original issue: http://code.google.com/p/rapidjson/issues/detail?id=4

@miloyip
Copy link
Collaborator Author

miloyip commented Jun 6, 2014

From milo...@gmail.com on December 02, 2011 20:43:44

Reader/Writer can now perform transcoding with Transcoder.
New EncodedInputStream can decode characters from byte input stream
New EncodedOutputStream can encode characters to byte output stream
New AutoUTFInputStream can specify an UTF encoding in runtime, or detect UTF encoding from the beginning of stream (BOM and RFC4627 ). And then it can dynamically delicate operations to the actual UTF encoding.
New AutoUTFOutputStream can specify an UTF encoding in runtime, optionally writes BOM.
New AutoUTF can do operations according to UTF encoding type in the input/output stream.
All AutoXXX classes can handle UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE.

Status: Fixed

@miloyip miloyip closed this as completed Jun 6, 2014
@xinthose xinthose mentioned this issue Aug 19, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant