-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
Description
I would like to propose the addition of a new binary string type BString to stdlib. Like String, it would be a subtype of AbstractString, but unlike String, it can only hold sequences of Unicode characters in the range U+0000 to U+00FF. In memory, BString encodes each character as one byte (like ISO 8859-1). BString would have the same in-memory layout as String such that encoding a String value into UTF-8 and returning it as a BString byte sequence would be a no-op.
The name BString could be ambiguously interpreted as ”binary string”, ”byte string” or ”basic Latin string“, because this type has multiple functions and advantages over String:
-
It would be well suited to process arbitrary binary data (as character = byte) using all the convenient string-processing and I/O functions available for
AbstractString, but without a UTF-8 decoder always running in the background. -
It would also be well suited for processing text data where only ASCII characters are of Interest (even if the text data is UTF-8 encoded!).
A particular advantage of BString over Vector{UInt8} is that BString has the exact same in-memory representation as String and therefore conversion between String and BString would be a no-op in all situations where the algorithm does not care about non-ASCII characters.
Imagine for example you write a parser, such as for a CSV file, which only cares about ASCII metacharacters (in the case of a CSV file: commas, quotes and linefeeds). Such an algorithm does not care the least about the UTF-8 character encoding. Everything other than the metacharacters are just byte sequences, whether they are in ISO 8859, UTF-8 or EUC-JP, that get passed on as such. However, with String there is essentially a UTF-8 decoder running in the background all the time, often completely unnecessarily, if only ASCII characters are of interest. By reinterpreting a String variable as a BString variable, a programmer can essentially tell Julia: I'm not interested here in UTF-8 decoding any Unicode characters at all, either because I only look for ASCII characters, or because this is really arbitrary binary data (e.g., a JPEG header), and I merely want to use the string parsing functionality that comes with AbstractString. BString would do exactly that.
Offering to the Programmer both a UTF-8 (String) and a binary (BString) variant of the AbstractString library would essentially be doing exactly what Perl does (where each string has a built-in UTF-8 flag that says whether each element of the string sequence is a byte or a Unicode character). In fact, the dynamic Perl string type with UTF-8 flag would in Julia then be identical to Union{String,BString}. This has worked extremely well since Perl 5.8 and binary string processing (having the full String API available for binary data) is something that I very much miss in Julia.
The main reason for why BString should go into stdlib, and not into a package, is very simple: to keep the methods available for String and BString exactly aligned. I would therefore like to implement each BString method one line below the corresponding implementation of the String method (i.e., in the same file!), such that when future extensions to the String API are made, BString is updated as well. Also, I feel that providing a binary string type is an extremely basic and elementary function that should be part of the standard library.
Vector{UInt8} has a completely different memory layout and function API from String (motivated by the mutable MATLAB-like matrix type, with dimensions, etc.) and is therefore no replacement for BString. BString would naturally offer regular expressions, number formatting, substring searching, and lots of other string processing and IO functionality that Vector{UInt8} does not.
Julia has already decided to make String a quite different data type from a Vector of characters, and therefore we need a binary, non-UTF-8 version of String as well.