Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Roadmap for improving string support in Julia #11558

Closed
ScottPJones opened this issue Jun 3, 2015 · 12 comments
Closed

RFC: Roadmap for improving string support in Julia #11558

ScottPJones opened this issue Jun 3, 2015 · 12 comments
Labels
domain:unicode Related to unicode characters and encodings kind:julep Julia Enhancement Proposal

Comments

@ScottPJones
Copy link
Contributor

I would like comments on the things that I would like to see happen for string support in Julia,
such as, constructive criticism or advice on how to achieve the goals, other issues that I hadn't noticed yet, relative priority of goals,...

  1. Make all string conversion methods be consistent. Current problems are:
  • Some methods validate input, some don't
  • Some handle replacements for bad input, others only throw errors
  • Correct handling of the Unicode standard (with Fix #10959 bugs with UTF-16 conversions #11551 the issues I've seen for UTF-16 and UTF-32 will be fixed, but UTF-8 would still need to be handled)
  • Better error reporting (again, with Fix #10959 bugs with UTF-16 conversions #11551 this would be handled for UTF-16/32, but it would need to be done for UTF-8/ASCII as well)
  • All methods should only produce 100% valid Unicode encodings (at least by default... other options could be added as needed, but hopefully that would not happen), instead of passing on bad input (which can also be a security hole) (solved partially by Fix #10959 bugs with UTF-16 conversions #11551 for UTF-16/32, but UTF-8/ASCII need to be handled, as well as some methods that currently do no checking at all)
  1. Fix issues I have found related to mapping, upper/lower case and type stability.
    uppercase/lowercase on a UTF16String returns a UTF8String #11460 Bugs in Unicode handling with UTF8String #11463 map on an AbstractString, if no more specific map method found, returns UTF8String always #11464 convert function for UTF-16 from an AbstractArray{UInt8} problems #11501 Problems with convert from AbstractArray{UInt8} to UTF32String #11502

  2. Make julia no longer dependent on utf8proc. See RFC: rewrite functions used from utf8proc.c in Julia  #11315, and also see 1.3-dev1 fails to build on 64-bit JuliaStrings/utf8proc#42, and the comments in uppercase/lowercase functions are not portable? #11471 to see what a mess this causes dependence, even for one of the top contributors to Julia.
    This can have a number of benefits, not limited to improving performance, including being able to rebuild and distribute updated Unicode data files, without having to get a new version of Julia

  3. Improve string conversion performance (just needs Fix #10959 bugs with UTF-16 conversions #11551 merged in)

  4. Test (as a package at first) having always validated String and Char types, to see how that could affect performance (and reliability of code).

... I feel there's a lot more ... feel free to add to the list!

@elextr
Copy link

elextr commented Jun 3, 2015

For what its worth, some thoughts from a similar set of questions about UTF8 handling from another project.

  1. always check encodings at the boundary, ie reads, external functions (can't trust C code), internal functions are intended to only produce valid strings, so checking internally is a compile time debug option.

  2. what to do when a bad encoding is located is an issue:

a) errors give the code a chance to handle the problem, but checking them can be forgotten.

b) exceptions require all operations checking encodings to be wrapped in try or they become a DOS vector, and that can be forgotten just like checking error returns

c) substitutions may give valid encoding, but the resulting string can then go on to cause problems elsewhere because of the substitutions

Net result, no "one size fits all" answer does, the library has to allow the programmer to make the choice how they want to handle it. Which makes the library more complex and makes even more of an argument to check only in those functions that operate at the boundaries and not on every internal operation.

@peter1000
Copy link

faster julia str search using stsrt is about 10 times faster depending on the input on my Arch Linux. https://github.com/peter1000/faster_julia_str_search

issue

@tkelman tkelman added kind:julep Julia Enhancement Proposal domain:unicode Related to unicode characters and encodings labels Jun 3, 2015
@ScottPJones
Copy link
Contributor Author

@tkelman What does the "julep" label mean?

@elextr To me, the "convert" and utfxx() functions are boundaries.
The UTF8String, UTF16String, UTF32String constructors don't do checking.
A read would use the convert function, possibly with (as both of us have said) options to support giving either an error, or use a default or supplied replacement, and options to handle different variants of Unicode encoding (Modified UTF-8 that Java uses, or the CESU-8 that Oracle and MySQL use...).
I don't think that we should ever produce an invalidly encoded string, no matter what we accept on conversion. Then "isvalid" for those string types would simply always return true.
People should be able to also use Vector{UInt8}, Vector{UInt16}, and Vector{UInt32} (not Vector{Char}!) to store and manipulate strings that aren't guaranteed to be valid.
At least, IMO, that is what we should eventual get to.
This would also mean, for example, if we are loading from a database where we know that the string data was already validated, we could save a lot of time by skipping the validation step... (as @mbauman's nice little test showed, that can be a substantial gain!)

@peter1000 I was just pointed at the whole mess of search/rsearch/rsearch/searchindex/rsearchindex this weekend by @tkelman (I will have to get back at him somehow, maybe at JuliaCon 😀)
Do you have some nice string search algorithms you'd like to try out? I'd love to play around with trying out KMP, BM, BMH, BMHR, etc. algorithms in Julia... might be a good thing for teaching, to have well written examples of each of the best search algorithms implemented in Julia, because Julia performs so well, and you can play around with graphing the results easily, looking at the LLVM or native code generated... etc.

Thanks everybody for joining in the (hopefully productive) discussion!

@tkelman
Copy link
Contributor

tkelman commented Jun 3, 2015

julep = "julia enhancement proposal"

@pao
Copy link
Member

pao commented Jun 3, 2015

Roughly an extremely informal version of a Python PEP, or Rust RFC. Eventual level of formality TBD.

@peter1000
Copy link

Do you have some nice string search algorithms you'd like to try out? I'd love to play around with trying out KMP, BM, BMH, BMHR, etc. algorithms in Julia

I had a look at this stringsearch

But on all tests: the Arch linux default strstr was the fastest. But I did not further investigate. Also in the meantime julia might have improved?

@ScottPJones
Copy link
Contributor Author

@peter1000 Thanks for the pointer to @gquere's repository... that precisely the sort of thing I'd like to see, implemented all in Julia, well documented, to show the various tradeoffs of the different algorithms.

@pao & @tkelman Thanks for the info... I just kept think about having a mint Julep... I know about PEPs, should have figured it out from that!

@elextr
Copy link

elextr commented Jun 3, 2015

@ScottPJones another place where conversions clearly can be unchecked is output since the internal string is correct, and writing to the database, but unless this is the only program writing to it then I wouldn't trust it for input.

@peter1000 strstr() is likely to be hardware assisted, so its usually hard to beat for simple search.

@ScottPJones
Copy link
Contributor Author

@elextr Since I will be writing both input and output to the database, I can definitely be sure of the input... (the database access is totally controlled...)

@elextr
Copy link

elextr commented Jun 4, 2015

On 4 June 2015 at 10:20, Scott P. Jones notifications@github.com wrote:

@elextr https://github.com/elextr Since I will be writing both input
and output to the database, I can definitely be sure of the input... (the
database access is totally controlled...)

​Sure you can in a specific application, but a general Julia library can't
know that, it has to be up to you to tell it.​


Reply to this email directly or view it on GitHub
#11558 (comment).

@ScottPJones
Copy link
Contributor Author

@elextr Isn't that precisely where an unsafe_convert method would be appropriate, for skipping the validity checks? I like Julia's convention of calling things unsafe_, makes people take extra care when bypassing checks, even when the checks are unnecessary.

@KristofferC
Copy link
Sponsor Member

This feels subsumed by #16107

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:unicode Related to unicode characters and encodings kind:julep Julia Enhancement Proposal
Projects
None yet
Development

No branches or pull requests

6 participants