RFC: Roadmap for improving string support in Julia #11558

ScottPJones · 2015-06-03T10:51:10Z

I would like comments on the things that I would like to see happen for string support in Julia,
such as, constructive criticism or advice on how to achieve the goals, other issues that I hadn't noticed yet, relative priority of goals,...

Make all string conversion methods be consistent. Current problems are:

Some methods validate input, some don't
Some handle replacements for bad input, others only throw errors
Correct handling of the Unicode standard (with Fix #10959 bugs with UTF-16 conversions #11551 the issues I've seen for UTF-16 and UTF-32 will be fixed, but UTF-8 would still need to be handled)
Better error reporting (again, with Fix #10959 bugs with UTF-16 conversions #11551 this would be handled for UTF-16/32, but it would need to be done for UTF-8/ASCII as well)
All methods should only produce 100% valid Unicode encodings (at least by default... other options could be added as needed, but hopefully that would not happen), instead of passing on bad input (which can also be a security hole) (solved partially by Fix #10959 bugs with UTF-16 conversions #11551 for UTF-16/32, but UTF-8/ASCII need to be handled, as well as some methods that currently do no checking at all)

Fix issues I have found related to mapping, upper/lower case and type stability.
uppercase/lowercase on a UTF16String returns a UTF8String #11460 Bugs in Unicode handling with UTF8String #11463 map on an AbstractString, if no more specific map method found, returns UTF8String always #11464 convert function for UTF-16 from an AbstractArray{UInt8} problems #11501 Problems with convert from AbstractArray{UInt8} to UTF32String #11502
Make julia no longer dependent on utf8proc. See RFC: rewrite functions used from utf8proc.c in Julia #11315, and also see 1.3-dev1 fails to build on 64-bit JuliaStrings/utf8proc#42, and the comments in uppercase/lowercase functions are not portable? #11471 to see what a mess this causes dependence, even for one of the top contributors to Julia.
This can have a number of benefits, not limited to improving performance, including being able to rebuild and distribute updated Unicode data files, without having to get a new version of Julia
Improve string conversion performance (just needs Fix #10959 bugs with UTF-16 conversions #11551 merged in)
Test (as a package at first) having always validated String and Char types, to see how that could affect performance (and reliability of code).

... I feel there's a lot more ... feel free to add to the list!

The text was updated successfully, but these errors were encountered:

elextr · 2015-06-03T12:24:23Z

For what its worth, some thoughts from a similar set of questions about UTF8 handling from another project.

always check encodings at the boundary, ie reads, external functions (can't trust C code), internal functions are intended to only produce valid strings, so checking internally is a compile time debug option.
what to do when a bad encoding is located is an issue:

a) errors give the code a chance to handle the problem, but checking them can be forgotten.

b) exceptions require all operations checking encodings to be wrapped in try or they become a DOS vector, and that can be forgotten just like checking error returns

c) substitutions may give valid encoding, but the resulting string can then go on to cause problems elsewhere because of the substitutions

Net result, no "one size fits all" answer does, the library has to allow the programmer to make the choice how they want to handle it. Which makes the library more complex and makes even more of an argument to check only in those functions that operate at the boundaries and not on every internal operation.

peter1000 · 2015-06-03T12:38:03Z

faster julia str search using stsrt is about 10 times faster depending on the input on my Arch Linux. https://github.com/peter1000/faster_julia_str_search

issue

ScottPJones · 2015-06-03T13:19:21Z

@tkelman What does the "julep" label mean?

@elextr To me, the "convert" and utfxx() functions are boundaries.
The UTF8String, UTF16String, UTF32String constructors don't do checking.
A read would use the convert function, possibly with (as both of us have said) options to support giving either an error, or use a default or supplied replacement, and options to handle different variants of Unicode encoding (Modified UTF-8 that Java uses, or the CESU-8 that Oracle and MySQL use...).
I don't think that we should ever produce an invalidly encoded string, no matter what we accept on conversion. Then "isvalid" for those string types would simply always return true.
People should be able to also use Vector{UInt8}, Vector{UInt16}, and Vector{UInt32} (not Vector{Char}!) to store and manipulate strings that aren't guaranteed to be valid.
At least, IMO, that is what we should eventual get to.
This would also mean, for example, if we are loading from a database where we know that the string data was already validated, we could save a lot of time by skipping the validation step... (as @mbauman's nice little test showed, that can be a substantial gain!)

@peter1000 I was just pointed at the whole mess of search/rsearch/rsearch/searchindex/rsearchindex this weekend by @tkelman (I will have to get back at him somehow, maybe at JuliaCon 😀)
Do you have some nice string search algorithms you'd like to try out? I'd love to play around with trying out KMP, BM, BMH, BMHR, etc. algorithms in Julia... might be a good thing for teaching, to have well written examples of each of the best search algorithms implemented in Julia, because Julia performs so well, and you can play around with graphing the results easily, looking at the LLVM or native code generated... etc.

Thanks everybody for joining in the (hopefully productive) discussion!

tkelman · 2015-06-03T13:20:30Z

julep = "julia enhancement proposal"

pao · 2015-06-03T13:21:20Z

Roughly an extremely informal version of a Python PEP, or Rust RFC. Eventual level of formality TBD.

peter1000 · 2015-06-03T13:27:22Z

Do you have some nice string search algorithms you'd like to try out? I'd love to play around with trying out KMP, BM, BMH, BMHR, etc. algorithms in Julia

I had a look at this stringsearch

But on all tests: the Arch linux default strstr was the fastest. But I did not further investigate. Also in the meantime julia might have improved?

ScottPJones · 2015-06-03T13:57:02Z

@peter1000 Thanks for the pointer to @gquere's repository... that precisely the sort of thing I'd like to see, implemented all in Julia, well documented, to show the various tradeoffs of the different algorithms.

@pao & @tkelman Thanks for the info... I just kept think about having a mint Julep... I know about PEPs, should have figured it out from that!

elextr · 2015-06-03T22:23:39Z

@ScottPJones another place where conversions clearly can be unchecked is output since the internal string is correct, and writing to the database, but unless this is the only program writing to it then I wouldn't trust it for input.

@peter1000 strstr() is likely to be hardware assisted, so its usually hard to beat for simple search.

ScottPJones · 2015-06-04T00:20:13Z

@elextr Since I will be writing both input and output to the database, I can definitely be sure of the input... (the database access is totally controlled...)

elextr · 2015-06-04T00:40:57Z

On 4 June 2015 at 10:20, Scott P. Jones notifications@github.com wrote:

@elextr https://github.com/elextr Since I will be writing both input
and output to the database, I can definitely be sure of the input... (the
database access is totally controlled...)

Sure you can in a specific application, but a general Julia library can't
know that, it has to be up to you to tell it.

—
Reply to this email directly or view it on GitHub
#11558 (comment).

ScottPJones · 2015-06-07T06:15:50Z

@elextr Isn't that precisely where an unsafe_convert method would be appropriate, for skipping the validity checks? I like Julia's convention of calling things unsafe_, makes people take extra care when bypassing checks, even when the checks are unnecessary.

KristofferC · 2017-05-26T19:05:01Z

This feels subsumed by #16107

ScottPJones mentioned this issue Jun 3, 2015

Fix #10959 bugs with UTF-16 conversions #11551

Merged

tkelman added kind:julep Julia Enhancement Proposal domain:unicode Related to unicode characters and encodings labels Jun 3, 2015

ScottPJones mentioned this issue Jul 29, 2015

Add new options for convert to allow/disallow accepting different input and invalids with replacement strings #12358

Closed

KristofferC closed this as completed May 26, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Roadmap for improving string support in Julia #11558

RFC: Roadmap for improving string support in Julia #11558

ScottPJones commented Jun 3, 2015

elextr commented Jun 3, 2015

peter1000 commented Jun 3, 2015

ScottPJones commented Jun 3, 2015

tkelman commented Jun 3, 2015

pao commented Jun 3, 2015

peter1000 commented Jun 3, 2015

ScottPJones commented Jun 3, 2015

elextr commented Jun 3, 2015

ScottPJones commented Jun 4, 2015

elextr commented Jun 4, 2015

ScottPJones commented Jun 7, 2015

KristofferC commented May 26, 2017

RFC: Roadmap for improving string support in Julia #11558

RFC: Roadmap for improving string support in Julia #11558

Comments

ScottPJones commented Jun 3, 2015

elextr commented Jun 3, 2015

peter1000 commented Jun 3, 2015

ScottPJones commented Jun 3, 2015

tkelman commented Jun 3, 2015

pao commented Jun 3, 2015

peter1000 commented Jun 3, 2015

ScottPJones commented Jun 3, 2015

elextr commented Jun 3, 2015

ScottPJones commented Jun 4, 2015

elextr commented Jun 4, 2015

ScottPJones commented Jun 7, 2015

KristofferC commented May 26, 2017