Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove RevString; efficient generic reverseind #24708

Merged
merged 4 commits into from
Dec 4, 2017
Merged

Conversation

StefanKarpinski
Copy link
Sponsor Member

@StefanKarpinski StefanKarpinski commented Nov 22, 2017

These seem unrelated, but they're actually linked:

  • If you reverse generic strings by wrapping them in RevString then this generic reverseind is incorrect.

  • In order to have a correct generic reverseind one needs to assume that reverse(s) returns a string of the same type and encoding as s with code points in reverse order; one also needs to assume that the code units encoding each character remain the same when reversed. This is a valid assumption for UTF-8, UTF-16 and (trivially) UTF-32.

Reverse string search functions are pretty messed up by this and I've fixed them well enough to work but they may be quite inefficient for long strings now. I'm not going to spend too much time on this since there's other work going on to generalize and unify searching APIs.

Defining reverseind generically to handle corner cases correctly also required introducing ncodeunits and adjusting the behavior of thisind somewhat. I could make separate PRs for those changes, but since they're required here I figured I'd leave them here. The commits are logically separate and can/should reviewed separately.

See also: #10593, #23612, #24103

Closes: #22611, #24613

@StefanKarpinski StefanKarpinski changed the title delete RevString; efficient generic reverseind remove RevString; efficient generic reverseind Nov 22, 2017
@bkamins
Copy link
Member

bkamins commented Nov 22, 2017

@StefanKarpinski Per discussion in #24414 will you rename thisind to curind in this PR?
Also given the proposed index arithmetic approach do you want to handle the issue I have raised in #24255 (comment) in this PR?

@Sacha0
Copy link
Member

Sacha0 commented Nov 22, 2017

(Spelling out currentind would be nice if so. Whether curind refers to something curried, something current, or a spice isn't immediately clear :). Best!)

@StefanKarpinski
Copy link
Sponsor Member Author

After considering it for a while, I think I still prefer the trio of prevind, thisind and nextind.

"""
ncodeunits(s::AbstractString)

The number of code units in a string. For eample, for UTF-8-like data such as
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"example"


The number of code units in a string. For eample, for UTF-8-like data such as
the default `String` type, the number of code units is the number of bytes in
the string, aka `sizeof(s)`. For a UTF-16 encoded string type, however, the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"a.k.a."

reverse(s::AbstractString) -> AbstractString

Reverses a string.
reverse(s::AbstractString) -> String
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd keep AbstractString (or remove any mention of the return type), since an implementation e.g. for UTF32String would rather return a string of the same type.

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry, I changed this before I realized that reverse needs to return the same type.


Technically, this function reverses the codepoints in a string, and its
main utility is for reversed-order string processing, especially for reversed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mention of the "main utility" was useful IMHO. It helps understand that the result does not have any particular meaning except for searches. Else people can complain that "Julia doesn't know how to reverse a Unicode string correctly" due to things like reverse("noël") == "l̈eon".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@ararslan ararslan added kind:deprecation This change introduces or involves a deprecation domain:strings "Strings!" labels Nov 22, 2017
reverse(s::RevString) = s.string

## reverse an index i so that reverse(s)[i] == s[reverseind(s,i)]
function reverse(s::AbstractString)::String
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought in #24613 you wanted to remove the reverse(::AbstractString) fallback, and always return a string of the same encoding (so that reverseind works)?

So shouldn't this be s::Union{String,SubString{String}}, here?

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I should definitely fix this. This is a holdover from the first pass through this when I thought that we could have reverse always produce a String.


## reverse an index i so that reverse(s)[i] == s[reverseind(s,i)]
function reverse(s::AbstractString)::String
sprint() do io
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the input is a String or substring thereof, can't we do better than sprint because we know exactly the size of the buffer that is required? i.e. we can just allocate v = StringVector(sizeof(s)), write into that, and return String(v)? Also, you can just write codeunits directly, rather than converting to/from Char via s[j]...

Oh, I guess you are keeping reverse(s::String) in string.jl .... it seems like that routine could be generalized to work for SubString{String} too, by calling codeunit rather than converting the input to Vector{UInt8}.

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I'm honestly still just trying to get all the tests to pass 😅

@StefanKarpinski
Copy link
Sponsor Member Author

So the test situation is a bit of a shitshow. But they all seem to be unrelated... so yay?

@StefanKarpinski
Copy link
Sponsor Member Author

Thanks for the careful review @nalimilan and @stevengj – I'll fix and update soon.


## reverse an index i so that reverse(s)[i] == s[reverseind(s,i)]
function reverse(s::S)::S where {S<:AbstractString}
# TODO: generic API for sprint to a particular encoding
Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether it's a good idea to have this generic reverse method or not. It does ensure that the reversed string type is the same as the input string type, but it doesn't do so in a particularly efficient way. Calling sprint with a string encoding based on the input string type could make this both generic and efficient, but we don't currently have a way to do that in Base. Thoughts, @nalimilan, @stevengj?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how this implementation ensures that the reversed string type is the same as the input string type. Doesn't sprint() always return String?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just implement this for String and SubString, and leave it for custom implementations to implement it. We can always add a generic method after 1.0 if we want.

Copy link
Sponsor Member Author

@StefanKarpinski StefanKarpinski Nov 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has a return type annotation which implicitly calls convert for you. Too subtle?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I missed that; still not used to looking for return-type annotations. +1 for omitting this fallback entirely.

@StefanKarpinski StefanKarpinski force-pushed the sk/revstring branch 2 times, most recently from ecc0bf0 to f1ea3aa Compare November 29, 2017 02:18
These seem unrelated, but they're actually linked:

* If you reverse generic strings by wrapping them in `RevString` then
  then this generic `reverseind` is incorrect.

* In order to have a correct generic `reverseind` one needs to assume
  that `reverse(s)` returns a string of the same type and encoding as
  `s` with code points in reverse order; one also needs to assume that
  the code units encoding each character remain the same when reversed.
  This is a valid assumption for UTF-8, UTF-16 and (trivially) UTF-32.

Reverse string search functions are pretty messed up by this and I've
fixed them well enough to work but they may be quite inefficient for
long strings now. I'm not going to spend too much time on this since
there's other work going on to generalize and unify searching APIs.

Close #22611
Close #24613

See also: #10593 #23612 #24103
@StefanKarpinski
Copy link
Sponsor Member Author

Well that's a delightful surprise! I guess the remaining question here is deprecation. Should we deprecate RevString? @ararslan, did you already do some work towards that? Should I merge this and let you resurrect the deprecation part of this from your PRs here and on LegacyStrings?

@ararslan
Copy link
Member

ararslan commented Dec 4, 2017

Should we deprecate RevString?

👍

@ararslan, did you already do some work towards that?

Yep, in #23612.

Should I merge this and let you resurrect the deprecation part of this from your PRs here and on LegacyStrings?

Sure.

@StefanKarpinski StefanKarpinski merged commit b8ee561 into master Dec 4, 2017
@StefanKarpinski StefanKarpinski deleted the sk/revstring branch December 4, 2017 21:04
@StefanKarpinski
Copy link
Sponsor Member Author

Ok, all yours, @ararslan!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:strings "Strings!" kind:deprecation This change introduces or involves a deprecation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

delete RevString
9 participants