|Julia||Windows||Linux & MacOS||Package Evaluator||CodeCov||Coveralls|
It represents an attempt to give Julia better string handling than possible with Base
I am now trying to make sure that all of the functionality in String and Char is implemented for Str and Chr, and to start optimizing the functions (although they are already substantially faster)
Strs.jl is now a container for a number of different packages from JuliaString.org
|Package||Release||Release Date||Linux & MacOS||Windows||Description|
|ModuleInterfaceTools||Tools to create a common API for all of these packages|
|StrAPI||Common API for string/character functionality|
|CharSetEncodings||Basic types/support for Character Sets, Encodings, and Character Set Encodings|
|MurmurHash3||Pure Julia implementation of MurmurHash3|
|Format||Python/C style formatting (based on Formatting)|
|StrLiterals||Extensible string literal support|
|StrFormat||Formatting extensions for literals|
|StrTables||Low-level support for entity tables|
|HTML_Entities||HTML character sequences|
|Emoji_Entities||Emoji names (including composite ones)|
|LaTeX_Entities||Julia LaTeX character names|
|Unicode_Entities||Unicode standard character names|
|StrEntities||Entity extensions for literals|
|InternedStrings||Save space by interning strings (by @oxinabox!)|
The new package ModuleInterfaceTools is used to set up a consistent and easy to use API for most of the cooperating packages, without having to worry too much about imports, exports, using, and what functions are part of a public API, and which ones are part of the internal development API for other packages to extend.
I would very much appreciate any constructive criticism, help implementing some of the ideas, ideas on how to make it perform better, bikeshedding on names and API, etc. Also, I'd love contributions of benchmark code and/or samples for different use cases of strings, or pointers to such (such as a way to get lots of tweets, to test mixed text and emojis, for example).
Architecture and Operations
The general philosophy of the architecture is as follows: have a single easy to use type that can replace
String that conforms to the recommendations of the Unicode Organization (which internally uses 4 types and is implemented currently as a Julia Union, and has O(1) indexing to characters, not just code units), as well as types to represent binary strings, raw unvalidated strings (made up of 1, 2, or 4 byte codepoints), as well as types for validated ASCII, Latin1, UCS2 (16-bit, BMP [Basic Multilingual Plane]), UTF-8, UTF-16, and UTF-32 encoded strings.
Optimizations for multi code unit encodings such as UTF-8 & UTF-16 will be moved to
StrUTF16 packages (before splitting them out, I'll make sure that the functionality still works, only with slower generic methods, so that you only take up the extra space if you need the faster speed).
Extensions such as converting to and from non-Unicode encodings, such as Windows CP-1252 or China's official character set, GB18030, will be done in another package,
Since things like
CodeUnits which I originally implemented for this package have now been added to
Base (in master), I have moved support for those to a file to provide them for v0.6.2.
There is a
codepoints iterator, which iterates over the unsigned integer codepoints in a string
(for strings with the
CodeUnitSingle trait, it is basically the same as the
Also in the works is using the new ability to add properties, in order to allow for different "views" of a string, no matter how it is stored internally, for example a
mystring.utf8 view, or a
mystring.utf16 view (that can use the internal cached copy if available, as an optimization).
Currently, there are the following types:
Str, which is the general, parameterized string type, and
Chr, the general, parameterized character type.
BinaryStrfor storing non-textual data as a sequence of bytes.
ASCIIStran ASCII string, composed of
LatinStra string using the Latin1 subset of Unicode, composed of
UCS2Stra string composed of characters (
UCS2Chrs) only in the Unicode BMP, stored as 2 byte code units (that each store a single codepoint)
UTF32Stra string with only valid Unicode characters, 0-0xd7ff, 0xe000-0x10ffff, stored as 4 byte code units.
UTF8Stra string with only valid Unicode characters, the same as
UTF32Str, however encoded using UTF-8, conforming to the Unicode Organization's standard, i.e. no long encodings, surrogates, or invalid bytes.
UTF16Stra string similar to `UTF8Str, encoded via UTF-16, also conforming to the Unicode standard, i.e. no out of order or isolated surrogates.
Text1Stra text string that may contain any sequence of bytes
Text2Stra text string that may contain any sequence of 16-bit words
Text4Stra text string that may contain any sequence of 32-bit words
UniStra Union type, that can be one of the following 4 types,
ASCIIStr, and 3 internal types:
_LatinStra byte string that must contain at least one character > 0x7f
_UCS2Stra word string that must contain at least one character > 0xff
_UTF32Stra 32-bit word string that must contain at least one character > 0xffff
The only real difference in handling
_LatinStr, is that uppercasing the characters
'µ': (Unicode U+00b5 (category Ll: Letter, lowercase) and
'ÿ': Unicode U+00ff (category Ll: Letter, lowercase) produces the BMP characters
'Μ': Unicode U+039c (category Lu: Letter, uppercase) and
'Ÿ': Unicode U+0178 (category Lu: Letter, uppercase) respectively in
_LatinStr, because it is just an optimization for storing the full Unicode character set, not the ANSI/ISO 8859-1 character set that ise used for the first 256 code points of Unicode.
Those three internal types should never be used directly, as indicated by the leading
_ in the name.
For all of the built-in
Str types, there is a corresponding built-in character set encoding, i.e.
There are also a number of similar built-in character sets defined (
*CharSet), and character types (
cse function returns the character set encoding for a string type or a string.
charset returns the character set, and
encoding returns the encoding.
There is a new API that I am working on for indexing and searching, (however there is a problem on v0.7 due to the deprecation for
find being overbroad, and causing a type of type piracy, deprecating methods of types not in Base):
Also there are more readable function names that always separate words with
_, and avoid hard to understand abbreviations:
In addition, I've added
Nobody is an island, and to achieve great things, one must stand on the shoulders of giants.
I would like to thank some of those giants in particular:
Tom Breloff, for showing how an ecosystem could be created in Julia, i.e. "Build it, and they will come", for providing some nice code in this PR (which I shamelessly pirated in order to create Format, and for good advice at JuliaCon.
Chris Rackaukas simply a star in Julia now, great guy, great advice, and great blogs about stuff that's usually way over my head. Julia is incredibly lucky to have him.
Jacob Quinn, for collaborating & discussions early on in Strings on ideas for better string support in Julia, as well as a lot of hard work on things dear to me, such as databases and importing/exporting data SQLite, ODBC, CSV, WeakRefStrings, DataStreams, Feather, JSON2
Tim Holy for the famous "Holy" Trait Trick, which I use extensively in the Str* packages, for the work along with Matt Bauman on making Julia arrays general, extensible while still performing well, and hence very useful in my work.
Tony Kelman, for very thorough reviews of my PRs, I learned a great deal from his (and other Julians') comments), including that I didn't have to code in C anymore to get the performance I desired.
Bogumił Kamiński who has been doing a great job testing and reviewing
Strs(as well as doing the same for the string/character support in Julia Base), as well as input into the design. (Also very glad to have co-opted him to become a member of the org)
Last but not least, Julia mathematical artist (and blogger!) extraordinaire, Cormullion, creator of our wonderful logo!
Kudos to all of the other contributors to Julia and the ever expanding Julia ecosystem!