-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
Add method split(str, dlm, ::Val{N}) for allocation-free splitting
#43557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
This implementation seems to be allocation-free: splitn(s::AbstractString, ::Val{1}, dlm=isspace, start::Integer=firstindex(s)) = (SubString(s, start, lastindex(s)),)
function splitn(s::AbstractString, ::Val{N}, dlm=isspace, start::Integer=firstindex(s)) where {N}
N > 0 || throw(ArgumentError("number of split parts $N must be positive"))
d = findnext(dlm, s, start)
isnothing(d) && throw(ArgumentError("delimiter not found"))
return (SubString(s, start, prevind(s, first(d))),
splitn(s, Val{N-1}(), dlm, nextind(s, last(d)))...)
end |
|
Good news: |
|
Actually, if that works, I see little reason to introduce a new function. That is succint and clear. A new function would just be more stuff people need to learn, and another exported symbol from Base. Better to mention this idiom in the documentation. I'll make another PR. Closing this PR, and will instead make another one that mentions it in the docs. |
… docstrings.
Because `split` allocates a new vector, it is unsuitable in high performance code.
`eachsplit`, while it currently does not avoid allocations, could potentially be used
for higher performance. However, manually extracting the first N elements from an
`eachsplit` iterator is not straightforward.
The `NTuple{N}(eachsplit(str, dlm))` pattern allows for doing N-1 splits without
allocating a container to hold the results.
See discussion in JuliaLang#43557.
|
Looks like the |
|
You have to do It might be worth adding something like: split(str::AbstractString, dlm, ::Val{N}) where {N} = NTuple{N}(eachsplit(str, dlm, limit=N))(Or alternatively an implementation like the one I gave above. Not sure if there is a performance advantage to one or the other?)
_rsplit(s::AbstractString, ::Val{1}, dlm, start::Integer=lastindex(s)) = (SubString(s, firstindex(s), start),)
function _rsplit(s::AbstractString, ::Val{N}, dlm, start::Integer=lastindex(s)) where {N}
N > 0 || throw(ArgumentError("number of rsplit parts $N must be positive"))
d = findprev(dlm, s, start)
isnothing(d) && throw(ArgumentError("delimiter not found"))
return (SubString(s, nextind(s, last(d)), start),
_rsplit(s, Val{N-1}(), dlm, prevind(s, first(d)))...)
end
rsplit(str::AbstractString, dlm, v::Val) = reverse(_rsplit(str, v, dlm))which gives e.g. julia> rsplit("a.b.c", '.', Val(2))
("a.b", "c") |
|
Given the fact that the |
|
Looks like my |
|
Right, perhaps adding an extra method to Wouldn't it be a better solution to figure out why your implementation is faster than just calling |
You're referring to the julia> split("a b", ' ')
3-element Vector{SubString{String}}:
"a"
""
"b"So, you only need to implement the (Basically, the In any case, we'll have to implement the |
|
I don't think there is any good reason to have two very similar, but not quite similar, implementations. It just adds mental overhead both for the user who needs to remember two different type signatures, and also a maintenance burden if W.r.t |
|
I wonder if |
|
Interesting idea. The problem I could see with this is that
|
|
Okay I've made the following changes:
From the user's perspective, Any more comments will be appeciated. |
|
Can we bikeshed the name a little?
I don't really love any spelling here (maybe i'd go Final bikeshed-y thought: Do we have any other functions exported from Base for which |
|
Why can't it just be a new method of |
|
It could be helpful for the name of the function to indicate if the numeric argument is
|
split(str, dlm, ::Val{N}) for allocation-free splitting
|
@stevengj I addressed most of your comments in this thread:
I haven't benchmarked this throughly, but casual benchmarking suggests it's reasonably fast. If you have time to give it a look, I'd appreciate it. |
|
Hmm, I don't love the error mode. Usually when you want to do exactly N splits, you're parsing some known format, and so having it throw an ArgumentError out if your control with too few fields is annoying. I'll see if I can figure out how to make it return nothing. |
This method splits `str` to an `NTuple`, generally without allocations. This is useful when dealing with performance sensitive string processing, and is a generalization of Python's `str.partition` and Rust's `str::split_once`.
|
Ok - I'm stuck here. Since this function is going to be used in parsing, I believe having it throw when trying to create too long a tuple (when there are too few delimiters) is unacceptable. However, I can't figure out how to make it allocation-free while also being able to return |
Isn't this just: |
|
That works when there is the correct number of fields in the string. However, it's difficult to handle malformed input (which must be expected when parsing text) using that approach - there is no mechanism for recovering from the error that will be thrown if you expect 6 fields but only get 5. As an example, here is a pattern that I often use in Julia: |
|
True, that can be more tedious to write out, and doesn't quite detect when there are not enough fields julia> part1, part2, part3, part4, part6, eol = Iterators.flatten((eachsplit(str, limit=6, keepempty=true), Iterators.repeated("")))
@assert isempty(eol) "incorrect limit value" |
|
Having run into this sort of case myself a lot, @jakobnissen do you have any plans to return to this PR? |
|
It got stuck because I found the function to be mostly useful when parsing. But there, you need error recovery - i.e. a |
Often, when processing text, you run into the need to split some string at a delimiter. Julia already provides the
eachsplititerator, as well as thesplitfunction. However, thesplitfunction returns an array and hence forces an allocation, which makes it unsuitable in performance sensitive code. Working directly with aeachsplititerator is possible in v1.8, but it's bothersome to calliterate(itr, state)multiple times to split a line in say, 5 parts.This issue is not unique to Julia. Python provides the method
str.partition, which splits a string exactly once and returns the result as a tuple. Rust provides the similarstr::split_once. Convenience methods like this are very useful when text processing. In Julia, however, we can generalize it to N splits, while still avoiding allocations.This PR adds a new method to
split:split(::AbstractString, dlm, ::Val{N}). I believe this is worth adding another export to Base becauseNote that this PR is based on #51646, which needs to be merged first.