Stringapalooza #16107

StefanKarpinski · 2016-04-28T22:17:29Z

0.5 Major tasks

Round 1

Base: merge ASCIIString, UTF8String and ByteString into String, replace ASCIIString & UTF8String with String #16058
Compat: add String, String: alias for new String or old UTF8String Compat.jl#192

Round 2

Base: replace utf8, bytestring ~~and string~~ with String, replace bytestring with String #16453, deprecate utf8 for String #16469
Base: replace s = ascii(s) with s = String(s); isascii(s) || error(...), ascii: only support checking String for pure ASCIIness #16396

Round 3

Base: cleanup conversion mess to and from string types String constructor can segfault #16470, String constructor is bad #16713, restore non-copying String behavior, add unsafe_string*, ... #16731
Base: provide a way to interact with Windows APIs, via Cwstring deprecate WString and wstring #16975, Base.transcode to replace utf8to16 and utf16to8 #16974
make package with ASCII, Latin-1, UTF-8, UTF-16 and UTF-32 string types (LegacyStrings).
Base: remove UTF16String and UTF32String remove UTF-16 and UTF-32 stuff #16590

Cleanup tasks

0.6 Major tasks

Round 4

Base: change Char representation (allow lossless string processing of any data)
Base: remove RepString (moved to LegacyStrings)
Base: remove RevString (move to package?)
Base: merge SubString and String (add offset field to String)

Cleanup tasks

make prevind("ll", 5) and such errors, replace ASCIIString & UTF8String with String #16058 (comment)
replace ASCIIString & UTF8String with String #16058 (comment)
improve isspace implementation, replace ASCIIString & UTF8String with String #16058 (comment)
figure out why removing seemingly redundant convert method breaks bootstrap, replace ASCIIString & UTF8String with String #16058 (comment)
simplify takebuf API, replace ASCIIString & UTF8String with String #16058 (comment), Simplify takebuf() API #19088
move docstrings inline

The text was updated successfully, but these errors were encountered:

JeffBezanson · 2016-04-28T22:20:52Z

Great list.

What about windows APIs that use utf-16?

StefanKarpinski · 2016-04-28T22:24:33Z

What about windows APIs that use utf-16?

Already taken care of: #15033.

Two more potential rounds:

removing RepString and RevString
merging SubString and String

Might not get to those until the next release though.

quinnj · 2016-04-28T22:47:05Z

What about:

String iteration: making this faster, based on your experiments here in the past
Decide on String indexing: Restrict indexing into strings to a special ByteIndex or StringIndex type #9297

I definitely think there's a lot to play with around merging Substring and String, but it certainly feels like 0.6 material.

tkelman · 2016-04-29T01:26:14Z

It was just added, but now that readstring(io) is just String(read(io)) it maybe doesn't even need a separate name forever

lobingera · 2016-04-29T06:37:11Z

just for my curiosity: Why should this be part of 0.5? The title of the release had something to do with Arrays ... not Strings?

nalimilan · 2016-04-29T10:09:17Z

+1 for including as much of this plan as possible into 0.5. Breakage better happen soon.

As regards moving ASCIIString, etc. to a package (round 3, bullet 2), see discussion on an implementation for any encoding in StringEncodings.jl here.

As regards changing Char's underlying representation(round 4, bullet 1), another step would also be to introduce AbstractChar and use it in method signatures to allow e.g. the ASCIIString replacement to implement ASCIIChar <: AbstractChar as a UInt8. This would allow people working with ASCII data to actually enjoy higher performance than before.

EDIT: Finally, I'd like to see a discussion regarding the opportunity of ensuring that String only holds valid UTF-8 data or not, and how to handle file paths (special string type or not). But I don't want to derail this already rich thread, so maybe better open a separate issue for that?

StefanKarpinski · 2016-04-29T12:29:26Z

just for my curiosity: Why should this be part of 0.5? The title of the release had something to do with Arrays ... not Strings?

Because I've been working on this for months and it's ready to go. It won't hold up the release anyway.

lobingera · 2016-04-29T12:47:40Z

I somehow agree, it will not hold the release of the language. But more syntax changes in 0.5 give some impact in porting packages to 0.5; maybe i'm just wrong that this causes effort ...

nalimilan · 2016-04-29T14:04:58Z

I somehow agree, it will not hold the release of the language. But more syntax changes in 0.5 give some impact in porting packages to 0.5; maybe i'm just wrong that this causes effort ...

And more syntax changes in 0.6 will cause effort for even more packages. And given the growth rate of the package ecosystem...

stevengj · 2016-04-29T15:47:35Z

#15033 didn't fully provide a path for external packages to call Windows APIs, without calling internal functions.

StefanKarpinski · 2016-04-29T16:45:01Z

True – added to the roadmap.

StefanKarpinski · 2016-05-06T15:14:37Z

Actually, question here: do people think we should keep ascii and have it convert strings to standard String type and error if the content is not plain ASCII? It's kind of a useful function to have.

StefanKarpinski · 2016-05-06T16:31:29Z

Another question about behavior. String and string don't actually behave in the same manner always:

julia> String(UInt8[97,98,99])
"abc"

julia> string(UInt8[97,98,99])
"UInt8[97,98,99]"

Any thoughts on resolving this? Currently utf8 and ascii behave like String not like string.

stevengj · 2016-05-06T16:37:04Z

Maybe String! if it takes ownership of the array, i.e. is "in-place"?

There is also the bytestring function, whose name seems like a holdover from ByteString, but whose function is essential.

StefanKarpinski · 2016-05-06T16:38:11Z

I'm pretty sure that usage of ! does not have the @JeffBezanson seal of approval.

stevengj · 2016-05-06T16:39:54Z

String¡, then.

JeffBezanson · 2016-05-06T16:41:57Z

This case doesn't bother me that much --- unlike utf8 and ascii, string is not an encoding.

stevengj · 2016-05-06T16:42:50Z

The point is, we need some function to replace bytestring(::Vector{UInt8}) and bytestring(::Ptr{UInt8}, [len]) (makes a copy of bytes), but also UTF8String(::Vector{UInt8}) and pointer_to_string (doesn't make a copy).

It makes sense to me to name them all the same thing, with ! for the "in-place" versions. But I don't know what that name should be. Maybe bytestring and bytestring!, where byte refers to the encoding? Or utf8 and utf8! to be more explicit about the encoding, with ascii doing an additional assert?

JeffBezanson · 2016-05-06T16:45:43Z

Can we just keep bytestring?

stevengj · 2016-05-06T16:46:34Z

@JeffBezanson, bytestring is fine if a bit vague, but it always makes a copy. Would you be okay with bytestring! for the non-copying version?

JeffBezanson · 2016-05-06T16:47:32Z

I'm not prepared to make this the first ever case where ! means something other than mutation.

stevengj · 2016-05-06T16:48:07Z

Then what do we call the non-copying version(s) of bytestring?

JeffBezanson · 2016-05-06T16:51:09Z

I've probably missed some of the discussion here, but can String have the same constructors that UTF8String had? Plus one with a Ptr argument?

stevengj · 2016-05-06T16:54:14Z

@JeffBezanson, the problem is that then it can't replace string.

JeffBezanson · 2016-05-06T16:57:56Z

That's fine with me. Somehow we need one function that wraps a UInt8 vector as a string, and another that gives you the output of print as a string. We could rename string to something like sprint, except that's taken.

StefanKarpinski · 2016-06-02T16:54:58Z

TODO: Cstring and Cwstring should ensure that string data is NUL terminated (as well as NUL free).

StefanKarpinski · 2016-06-02T16:58:32Z

Ref #16499. Also need a way to express conversion of String to non-NUL-terminated UTF-16 data with known length. Perhaps convert(Vector{UInt16}, s)?

stevengj · 2016-06-02T19:13:58Z

@StefanKarpinski, in previous incarnations, any UInt8 array allocated by Julia was automatically NUL-terminated internally; is this no longer the case? And UTF16String and UTF32String were NUL-terminated, so convert(Cwstring, string) was also.

JeffBezanson · 2016-06-28T05:26:12Z

Which items are slated for 0.5? Through round 3, IIRC?

StefanKarpinski · 2016-06-28T05:29:38Z

Yes, that's correct. Tomorrow/Wednesday I need to create a LegacyStrings package, put all the Unicode stuff in it and then merge my PR that removes all of that stuff with deprecations that point at it.

Part of #16107

Part of JuliaLang#16107

JeffBezanson · 2017-01-04T21:46:34Z

PR is up to remove RepString; it has already been added to LegacyStrings. RevString is used by some Base functions so we might want to leave it for now. Anything else here planned for 0.6?

StefanKarpinski · 2017-01-06T22:12:18Z

While you're doing string stuff, it's probably not too hard to just actually do utf-8 reversal on strings instead of using the RevString type – that would advance the highlander agenda just a bit further. (If you feel like it and have some spare type while waiting for type revamp test to run or something.)

StefanKarpinski · 2017-07-20T18:39:10Z

Other than removing RevString everything that's likely to be done is already done.

StefanKarpinski self-assigned this Apr 28, 2016

StefanKarpinski added unicode Related to unicode characters and encodings strings "Strings!" labels Apr 28, 2016

StefanKarpinski added this to the 0.5.0 milestone Apr 28, 2016

nalimilan mentioned this issue Apr 29, 2016

replace ASCIIString & UTF8String with String #16058

Merged

This was referenced May 20, 2016

deprecate utf8 for String #16469

Merged

remove UTF-16 and UTF-32 stuff #16590

Merged

ararslan mentioned this issue Jun 2, 2016

String constructor is bad #16713

Closed

This was referenced Jun 29, 2016

Register LegacyStrings: v0.1.0 JuliaLang/METADATA.jl#5502

Merged

bug in printing invalid char #17271

Closed

StefanKarpinski added the needs docs Documentation for this change is required label Jul 11, 2016

StefanKarpinski added a commit that referenced this issue Jul 11, 2016

dlmread: add deprecation warning for ignore_invalid_chars option

e95f5f2

Part of #16107

StefanKarpinski modified the milestones: 0.6.0, 0.5.0 Jul 11, 2016

mfasi pushed a commit to mfasi/julia that referenced this issue Sep 5, 2016

dlmread: add deprecation warning for ignore_invalid_chars option

9a0726f

Part of JuliaLang#16107

StefanKarpinski mentioned this issue Sep 6, 2016

display something useful for text/plain output of invalid String #18296

Merged

damiendr mentioned this issue Sep 16, 2016

replaced version test with test for defined symbols JuliaStrings/LegacyStrings.jl#7

Closed

nalimilan mentioned this issue Oct 24, 2016

Simplify takebuf() API #19088

Merged

nalimilan mentioned this issue Jan 5, 2017

Julep: Support universal newlines and make it default for Text IO #19785

Open

JeffBezanson modified the milestones: 1.0, 0.6.0 Jan 5, 2017

KristofferC mentioned this issue May 26, 2017

RFC: Roadmap for improving string support in Julia #11558

Closed

This was referenced Jun 29, 2017

delete RevString #22611

Closed

String as more of a byte vector type? #22616

Closed

StefanKarpinski closed this as completed Jul 20, 2017

StefanKarpinski added the excision Removal of code from Base or the repository label Sep 21, 2017

zaneli mentioned this issue Dec 7, 2017

replace bytestring with String(copy(v)) JuliaIO/Parquet.jl#13

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stringapalooza #16107

Stringapalooza #16107

StefanKarpinski commented Apr 28, 2016 •

edited

Loading

JeffBezanson commented Apr 28, 2016

StefanKarpinski commented Apr 28, 2016

quinnj commented Apr 28, 2016

tkelman commented Apr 29, 2016

lobingera commented Apr 29, 2016

nalimilan commented Apr 29, 2016 •

edited

Loading

StefanKarpinski commented Apr 29, 2016

lobingera commented Apr 29, 2016

nalimilan commented Apr 29, 2016 •

edited

Loading

stevengj commented Apr 29, 2016

StefanKarpinski commented Apr 29, 2016

StefanKarpinski commented May 6, 2016

StefanKarpinski commented May 6, 2016

stevengj commented May 6, 2016 •

edited

Loading

StefanKarpinski commented May 6, 2016

stevengj commented May 6, 2016 •

edited

Loading

JeffBezanson commented May 6, 2016

stevengj commented May 6, 2016 •

edited

Loading

JeffBezanson commented May 6, 2016

stevengj commented May 6, 2016

JeffBezanson commented May 6, 2016

stevengj commented May 6, 2016

JeffBezanson commented May 6, 2016

stevengj commented May 6, 2016

JeffBezanson commented May 6, 2016

StefanKarpinski commented Jun 2, 2016

StefanKarpinski commented Jun 2, 2016

stevengj commented Jun 2, 2016 •

edited

Loading

JeffBezanson commented Jun 28, 2016

StefanKarpinski commented Jun 28, 2016

JeffBezanson commented Jan 4, 2017

StefanKarpinski commented Jan 6, 2017

StefanKarpinski commented Jul 20, 2017

Stringapalooza #16107

Stringapalooza #16107

Comments

StefanKarpinski commented Apr 28, 2016 • edited Loading

0.5 Major tasks

Round 1

Round 2

Round 3

Cleanup tasks

0.6 Major tasks

Round 4

Cleanup tasks

JeffBezanson commented Apr 28, 2016

StefanKarpinski commented Apr 28, 2016

quinnj commented Apr 28, 2016

tkelman commented Apr 29, 2016

lobingera commented Apr 29, 2016

nalimilan commented Apr 29, 2016 • edited Loading

StefanKarpinski commented Apr 29, 2016

lobingera commented Apr 29, 2016

nalimilan commented Apr 29, 2016 • edited Loading

stevengj commented Apr 29, 2016

StefanKarpinski commented Apr 29, 2016

StefanKarpinski commented May 6, 2016

StefanKarpinski commented May 6, 2016

stevengj commented May 6, 2016 • edited Loading

StefanKarpinski commented May 6, 2016

stevengj commented May 6, 2016 • edited Loading

JeffBezanson commented May 6, 2016

stevengj commented May 6, 2016 • edited Loading

JeffBezanson commented May 6, 2016

stevengj commented May 6, 2016

JeffBezanson commented May 6, 2016

stevengj commented May 6, 2016

JeffBezanson commented May 6, 2016

stevengj commented May 6, 2016

JeffBezanson commented May 6, 2016

StefanKarpinski commented Jun 2, 2016

StefanKarpinski commented Jun 2, 2016

stevengj commented Jun 2, 2016 • edited Loading

JeffBezanson commented Jun 28, 2016

StefanKarpinski commented Jun 28, 2016

JeffBezanson commented Jan 4, 2017

StefanKarpinski commented Jan 6, 2017

StefanKarpinski commented Jul 20, 2017

StefanKarpinski commented Apr 28, 2016 •

edited

Loading

nalimilan commented Apr 29, 2016 •

edited

Loading

nalimilan commented Apr 29, 2016 •

edited

Loading

stevengj commented May 6, 2016 •

edited

Loading

stevengj commented May 6, 2016 •

edited

Loading

stevengj commented May 6, 2016 •

edited

Loading

stevengj commented Jun 2, 2016 •

edited

Loading