Skip to content

Conversation

@jakobnissen
Copy link
Member

@jakobnissen jakobnissen commented Dec 26, 2021

Often, when processing text, you run into the need to split some string at a delimiter. Julia already provides the eachsplit iterator, as well as the split function. However, the split function returns an array and hence forces an allocation, which makes it unsuitable in performance sensitive code. Working directly with a eachsplit iterator is possible in v1.8, but it's bothersome to call iterate(itr, state) multiple times to split a line in say, 5 parts.

This issue is not unique to Julia. Python provides the method str.partition, which splits a string exactly once and returns the result as a tuple. Rust provides the similar str::split_once. Convenience methods like this are very useful when text processing. In Julia, however, we can generalize it to N splits, while still avoiding allocations.

This PR adds a new method to split: split(::AbstractString, dlm, ::Val{N}). I believe this is worth adding another export to Base because

  • Splitting is a common task with strings
  • The API provided by the new method is both convenient and can be implemented fast enough that users won't have to roll their own
  • It is not completely trivial to figure out how to correctly write this function.

Note that this PR is based on #51646, which needs to be merged first.

@dkarrasch dkarrasch added the strings "Strings!" label Dec 27, 2021
@stevengj
Copy link
Member

stevengj commented Dec 29, 2021

This implementation seems to be allocation-free:

splitn(s::AbstractString, ::Val{1}, dlm=isspace, start::Integer=firstindex(s)) = (SubString(s, start, lastindex(s)),)
function splitn(s::AbstractString, ::Val{N}, dlm=isspace, start::Integer=firstindex(s)) where {N}
    N > 0 || throw(ArgumentError("number of split parts $N must be positive"))
    d = findnext(dlm, s, start)
    isnothing(d) && throw(ArgumentError("delimiter not found"))
    return (SubString(s, start, prevind(s, first(d))),
            splitn(s, Val{N-1}(), dlm, nextind(s, last(d)))...)
end

@JeffBezanson
Copy link
Member

Good news: NTuple{N}(eachsplit(str, delim)) appears to work for this with no allocation 🎉

@jakobnissen
Copy link
Member Author

jakobnissen commented Jan 8, 2022

Actually, if that works, I see little reason to introduce a new function. That is succint and clear. A new function would just be more stuff people need to learn, and another exported symbol from Base. Better to mention this idiom in the documentation. I'll make another PR.

Closing this PR, and will instead make another one that mentions it in the docs.

@jakobnissen jakobnissen closed this Jan 8, 2022
jakobnissen added a commit to jakobnissen/julia that referenced this pull request Jan 8, 2022
… docstrings.

Because `split` allocates a new vector, it is unsuitable in high performance code.
`eachsplit`, while it currently does not avoid allocations, could potentially be used
for higher performance. However, manually extracting the first N elements from an
`eachsplit` iterator is not straightforward.
The `NTuple{N}(eachsplit(str, dlm))` pattern allows for doing N-1 splits without
allocating a container to hold the results.

See discussion in JuliaLang#43557.
@jakobnissen jakobnissen reopened this Jan 10, 2022
@jakobnissen
Copy link
Member Author

Looks like the NTuple{N}(eachsplit(s, dlm)) doesn't quite work. In the example NTuple{2}(eachsplit("a.b.c", '.')), it returns ("a", "b") where it should return ("a", "b.c").

@stevengj
Copy link
Member

stevengj commented Jan 10, 2022

You have to do NTuple{2}(eachsplit("a.b.c", '.', limit=2)) to get ("a", "b.c").

It might be worth adding something like:

split(str::AbstractString, dlm, ::Val{N}) where {N} = NTuple{N}(eachsplit(str, dlm, limit=N))

(Or alternatively an implementation like the one I gave above. Not sure if there is a performance advantage to one or the other?)

(The rsplit method requires #43740.) Would also be nice to have a similar rsplit method, but that requires an eachrsplit method that we don't have, or alternatively something like:

_rsplit(s::AbstractString, ::Val{1}, dlm, start::Integer=lastindex(s)) = (SubString(s, firstindex(s), start),)
function _rsplit(s::AbstractString, ::Val{N}, dlm, start::Integer=lastindex(s)) where {N}
    N > 0 || throw(ArgumentError("number of rsplit parts $N must be positive"))
    d = findprev(dlm, s, start)
    isnothing(d) && throw(ArgumentError("delimiter not found"))
    return (SubString(s, nextind(s, last(d)), start),
            _rsplit(s, Val{N-1}(), dlm, prevind(s, first(d)))...)
end
rsplit(str::AbstractString, dlm, v::Val) = reverse(_rsplit(str, v, dlm))

which gives e.g.

julia> rsplit("a.b.c", '.', Val(2))
("a.b", "c")

@stevengj
Copy link
Member

Given the fact that the rsplit case is impossible to accomplish with eachsplit, it seems worthwhile to support Val arguments for both split and rsplit.

@stevengj
Copy link
Member

stevengj commented Jan 10, 2022

Looks like my splitn implementation in #43557 (comment) is about 20% faster than NTuple{N}(eachsplit(str, dlm, limit=N)) for N=2 on a small string (which is probably the most common case for this functionality). Since it's only 8 lines, and an analogous implementation is required anyway for rsplit, it seems like it would be good to use it.

@jakobnissen
Copy link
Member Author

jakobnissen commented Jan 10, 2022

Right, perhaps adding an extra method to split is preferable to adding a new function. It's better for discovery, and also avoids exporting another name.

Wouldn't it be a better solution to figure out why your implementation is faster than just calling NTuple on eachsplit though? I would much prefer if this function simply delegated the hard work to eachsplit so we avoid having two implementations of the same functionality. There is also some edge cases that need to be considered, for example, splitting "a b" (note: two spaces) on space should yield ("a", "b"), not ("a", " b"). These edge cases are already taking care of by eachsplit.

@stevengj
Copy link
Member

stevengj commented Jan 10, 2022

There is also some edge cases that need to be considered

You're referring to the keepempty keyword argument, which only defaults to false if you omit the delimiter argument from split or eachsplit. If you pass a dlm argument explicitly, as in my rsplit method above, it defaults to true and the behavior is consistent with my implementation:

julia> split("a  b", ' ')
3-element Vector{SubString{String}}:
 "a"
 ""
 "b"

So, you only need to implement the keepempty behavior if you explicitly support this keyword or if you support not passing a dlm argument, and I would tend to suggest doing neither for the Val(N) methods.

(Basically, the Val(N) methods only make sense to me if you have a very rigid format you want to parse. You can always use r"\s+" as the dlm to treat multiple spaces as a single space, of course.)

In any case, we'll have to implement the rsplit case manually anyway, so we'll want to implement split and rsplit consistently.

@jakobnissen
Copy link
Member Author

I don't think there is any good reason to have two very similar, but not quite similar, implementations. It just adds mental overhead both for the user who needs to remember two different type signatures, and also a maintenance burden if eachsplit and split_n needs to behave differently and/or have different implementations.

W.r.t rsplit, I think the better solution is to first implement eachrsplit, then build rsplit on top of that.

@vtjnash
Copy link
Member

vtjnash commented Feb 3, 2022

I wonder if splitonce(haystack, needle; keep) = (left, right) would be sufficient here, and therefore better, and then instruct the user to apply folds themselves if they need to iteratively take more from the right (rather than implementing internally what are probably essentially folds with ntuples)?

@jakobnissen
Copy link
Member Author

Interesting idea. The problem I could see with this is that

  • Users need to figure out for themselves how to do multiple splits using folds, which may not be obvious. In fact, I can't figure out how to do it
  • Users might implement this in different ways, and it'll be hard to make sure they are all allocation-free, compared to giving a single implementation

@jakobnissen
Copy link
Member Author

Okay I've made the following changes:

  • Now just calls NTuple{N}(eachsplit(str, dlm, limit=N)) internally. eachsplit is not optimized, but it's better to optimize that than to have a separate, faster implementation here. That's just duplicate work.
  • Add methods so the interface matches split as closely as possible.

From the user's perspective, split_n should work as much like split as possible, while not allocating. If necessary, we can then later add specialized, faster methods for e.g. split_n(::Union{String, SubString{String}}, ::Val, ::Char).

Any more comments will be appeciated.

See also #45393 and #40120

@nickrobinson251
Copy link
Contributor

nickrobinson251 commented Oct 10, 2022

Can we bikeshed the name a little? split_n just looks a little odd to me. I don't think it fits any existing naming convention in Base.

  • Perhaps just because Base omits the underscore in most cases, so perhaps splitn?
  • Alternatively we have n as the prefix in ntuple, ndims -- which suggest nsplit or nsplits?
  • If we want something perhaps more informative than "n" how about "into" like split_into(str, 3) which i think reads okay (split_into(str, 3) reads as "str split into 3") or (split_into(str, 3, 'a') reads as "str split into 3 on 'a'") -- so split_into?

I don't really love any spelling here (maybe i'd go nsplit), but split_n still looks most odd.

Final bikeshed-y thought: Do we have any other functions exported from Base for which Val must be passed? (Maybe reshape?) I wonder if there should be a method that accepts an Int and wraps it in Val? Or is that too much of a performance foot-gun (given the motivation for this existing is perf)?

@stevengj
Copy link
Member

Why can't it just be a new method of split?

@jariji
Copy link
Contributor

jariji commented Oct 27, 2022

It could be helpful for the name of the function to indicate if the numeric argument is

  • the number of values in the output, such as split_into, split_parts
  • the number of split points, such as split_times

@jakobnissen jakobnissen changed the title Add split_n function Add method split(str, dlm, ::Val{N}) for allocation-free splitting Oct 10, 2023
@jakobnissen jakobnissen added the feature Indicates new feature / enhancement requests label Oct 10, 2023
@jakobnissen jakobnissen requested a review from stevengj October 10, 2023 11:56
@jakobnissen
Copy link
Member Author

@stevengj I addressed most of your comments in this thread:

  • It's now a method of split and rsplit
  • It now also covers rsplit

I haven't benchmarked this throughly, but casual benchmarking suggests it's reasonably fast. If you have time to give it a look, I'd appreciate it.

@jakobnissen
Copy link
Member Author

Hmm, I don't love the error mode. Usually when you want to do exactly N splits, you're parsing some known format, and so having it throw an ArgumentError out if your control with too few fields is annoying. I'll see if I can figure out how to make it return nothing.

This method splits `str` to an `NTuple`, generally without allocations.
This is useful when dealing with performance sensitive string processing, and
is a generalization of Python's `str.partition` and Rust's `str::split_once`.
@jakobnissen jakobnissen removed the request for review from stevengj October 11, 2023 15:56
@jakobnissen jakobnissen marked this pull request as draft October 11, 2023 15:56
@jakobnissen
Copy link
Member Author

Ok - I'm stuck here. Since this function is going to be used in parsing, I believe having it throw when trying to create too long a tuple (when there are too few delimiters) is unacceptable. However, I can't figure out how to make it allocation-free while also being able to return nothing in the error case.

@vtjnash
Copy link
Member

vtjnash commented Oct 12, 2023

it's bothersome to call iterate(itr, state) for each part

Isn't this just:
part1, part2, parts... = eachsplit(str)

@jakobnissen
Copy link
Member Author

jakobnissen commented Oct 12, 2023

That works when there is the correct number of fields in the string. However, it's difficult to handle malformed input (which must be expected when parsing text) using that approach - there is no mechanism for recovering from the error that will be thrown if you expect 6 fields but only get 5.

As an example, here is a pattern that I often use in Julia:

for line in lines
    fields = split(line, '\t')
    length(fields) == 6 || error( <some informative error> )
    (a, b, c, d, e, f) = fields
    [ do stuff ]

@vtjnash
Copy link
Member

vtjnash commented Oct 12, 2023

True, that can be more tedious to write out, and doesn't quite detect when there are not enough fields

julia> part1, part2, part3, part4, part6, eol = Iterators.flatten((eachsplit(str, limit=6, keepempty=true), Iterators.repeated("")))
@assert isempty(eol) "incorrect limit value"

@tecosaur
Copy link
Member

Having run into this sort of case myself a lot, @jakobnissen do you have any plans to return to this PR?

@jakobnissen
Copy link
Member Author

It got stuck because I found the function to be mostly useful when parsing. But there, you need error recovery - i.e. a trysplit function that can return nothing if there are too few elements.
I couldn't figure out how to do that, so now I instead preallocate a vector, and then do append!(v, eachsplit(str, ...)).
But if it could be done, it would be better API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature Indicates new feature / enhancement requests strings "Strings!"

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants