Add a default implementation of `length` using `iterate`. #35947

andyferris · 2020-05-20T01:40:26Z

Here is an alternative to #35946.

For all iterables with SizeUnknown we default to using iterate to get the length when requested. Algorithms can still take care to check IteratorSize on unknown iterables in order to know if the function is O(1) or not.

The particular (new) behavior I am seeking is:

julia> (length ∘ skipmissing)([1, 2, missing, 4])
3

but I've had to write this defintion elsewhere before so I thought it made sense in Base. I also believe we should tie length to the iterate API, generically.

tkf · 2020-05-20T04:08:36Z

I don't think Base.IteratorSize is enough to check if it is safe to use this implementation. It'd break (somewhat naive) generic code when the iterator is stateful. Maybe we need another iterator trait like IteratorState (with :> Union{PureState,MutableState}).

JeffBezanson · 2020-05-20T05:48:10Z

I agree with @tkf; I think this violates the spirit of length. It might be ok for length to be O(n) provided the iterator indeed has a fixed, known length (e.g. a linked list), but currently we also use it to find out whether accessing an iterator should use incremental/online algorithms. A statefulness/non-repeatability trait makes sense, but it's not 100% the same issue. For instance, you can read a whole file to count the number of lines, and then read it again to look at the data, but you probably don't want to do that, and if you do you should be pretty explicit.

andyferris · 2020-05-20T09:06:11Z

Yeah - we could do with a trait to figure out if iterate is pure, and there would definitely be gotchas with this otherwise.

I'm assuming I should focus efforts on #35946 until then?

tkf · 2020-05-20T20:19:25Z

I don't think there is a way to implement iterate-based length reliably even we restrict it to skipmissing. It's very easy to come up with a case that this doesn't work:

julia> using Base.Broadcast: broadcasted, instantiate

julia> itr = (y ? missing : x for (x, y) in instantiate(broadcasted(tuple, 1:1000, broadcasted(rand, Bool))));

julia> Base.IteratorSize(itr)
Base.HasShape{1}()

julia> xs = skipmissing(itr);

julia> count(_ -> true, xs)
512

julia> count(_ -> true, xs)
470

andyferris · 2020-05-20T23:27:25Z

reliably

@tkf Your example is that of a random iterator. You can't do anything reliable with it! 🤣 I would expect sum, count or any other reduction function to behave randomly - similarly for length.

but currently we also use it to find out whether accessing an iterator should use incremental/online algorithms

Sorry @JeffBezanson I didn't quite understand this. Don't we use IteratorSize for that, rather than catch an error from length?

I think this violates the spirit of length.

and if you do you should be pretty explicit.

I guess that's my point - I'm not sure I undertand the spirit; I always assumed it was linked to the iterate API. When I type length ∘ skipmissing I already feel I am pretty explicitly saying I'm gonna do me some old-fashioned counting. I can call any mapreduce like function on such an iterable (like, say, count). How do I generically opt into counting how long the iterable is without resorting to typing count(_->true, itr)? Do we need a distinct function to represent that operation - i.e. would we like to export seperate functions for length (quicklength and slowlength, so to speak)? Should there be a single-argument count? (That doesn't make it ammenable to currying, though).

andyferris · 2020-05-20T23:36:39Z

Is the argument that sometimes for some stateful iterators you may wish to "peek" at the length without popping everything off?

andyferris · 2020-05-20T23:40:23Z

but currently we also use it to find out whether accessing an iterator should use incremental/online algorithms

Or is it this: If the length above some threshold, we use an incremental algorithm (that might be slower than an in-memory one)? Wouldn't the check already be something along the lines of IteratorSize(itr) != SizeUnknown(itr) && length(itr) < cutoff to avoid getting an error from length as it stands without this PR?

tkf · 2020-05-21T00:05:00Z

This PR can introduce a segfault in the programs that previously safely threw an error. For example:

function maptoints(f, xs)
    ys = Vector{Int}(undef, length(xs))
    for (i, x) in enumerate(xs)
        @inbounds ys[i] = f(x)
    end
    return ys
end

The tension here is length-as-reduction vs length-as-a-static-property. I think the best approach is to introduce a new function for the former.

JeffBezanson · 2020-05-22T16:20:40Z

Yes, the idea is that there are some iterators you should not call length on (stateful ones), and I think a method error is the best way to convey that. Most of the time you call length before iterating, to know how many items there will be, so if length itself consumes the iterator it defeats the purpose. Sure, we can document that you need to check a trait first, but that's just not as clear as a method error. Subjectively, I guess length "feels" like a property you look up, and not something that can kick off an arbitrary process.

tkf · 2020-05-22T19:43:31Z

Maybe it's reasonable to have

nitems(x) = haslength(x) ? length(x) : count(_ -> true, x)

in Base? (The function name is somewhat random.)

It may be useful to have a common interface you can overload when you have a better implementation than count(_ -> true, x). For example, it's reasonable to have Base.SizeUnknown for Iterators.flatten(::Vector{<:Vector}) but also it's possible to implement a better length. In that case, you may want to redirect nitems to length. I'm not sure if this situation is common enough, though.

martinholters · 2020-05-25T13:40:36Z

Ref. #35530 for sneaky bugs that can come up if the length changes unexpectedly.

tkf · 2020-05-27T01:51:55Z

Thinking about this more, I think it makes sense to have nitems in Base (or maybe in DataAPI.jl). There are some nice optimizations you can do with it:

nitems(x::Reverse) = nitems(x.itr)  # no need to go backward
nitems(x::Generator) = nitems(x.iter)  # no need to evaluate `f`
nitems(x::Accumulate) = nitems(x.itr)  # ditto
nitems(x::Flatten) = sum(nitems, x.it)  # inner iterators may have nice `nitems`

and so on.

andyferris · 2020-05-27T02:13:40Z

Yes, something along the lines of nitems seems fine to me - my preference would be having this in Base for all to share though.

Can we just use count? Ideally we'd want to allow currying in the future, too. Unfortunately that means we'd need to infer what the 1-argument form might mean using Function or Callable, like so:

count(predicate, iter) - eager evaluation
count(predicate::Callable) - curried form
count(iter) - eager evaluation equivalent to count(_ -> true, iter)
count() - curried form equivalent to count(_ -> true)

EDIT: What I like is that the "count" verb gives the connotation that it might literally iterate through and count the items, as opposed to length which would remain safe & fast.

tkf · 2020-05-27T02:38:11Z

I can see that count(itr) would be a very appealing API for count(_ -> true, itr). But, for the same reason as we discussed in the mergewith PR #34296, I prefer not to make the dispatch on Callable to change the semantics.

andyferris · 2020-05-27T03:14:16Z

Thanks, I couldn't quite recall the reasoning. @tkf wrote

But I think the list of *with functions is not large

Haha. Do we need to rename count to countwith? But seriously the duplicating of generic functions is annoying; it would ne nice to have a pattern for higher-order functions with "default" function inputs so we didn't have to create twice as many generic functions...

tkf · 2020-05-27T04:00:41Z

3. count(iter) - eager evaluation equivalent to count(_ -> true, iter)

Actually, my comment was misleading as the reason why we can't use count(itr) as count(_ -> true, itr) is more mundane. We already have a definition for count([true, false, true]) that returns 2.

But I think the list of *with functions is not large

Haha. Do we need to rename count to countwith?

Oops :) Though that comment was in a different context. It was rather a byproduct that it was possible to get the curried form mergewith(f) in #34296. The main point was that we need to distinguish merge(f, dicts...) and merge(dicts...). So, given merge(a, b, c), you don't know the semantics of the operation without knowing the type of a. I think @JeffBezanson's comment #34296 (comment) is a nice summary. But, if we are going to add curried version for all reducers, I'll need to take back my comment because then we are going to need countwith, sumwith, uniquewith, and so on...

Anyway, I think currying is rather orthogonal to the current issue because we can't break how count(itr) already works.

andyferris · 2020-05-27T04:47:58Z

True.

Damn - naming things is hard.

It seems that a lot of languages use length as the fast version and count for arbitrary iterators that might take O(N) time (sometimes size also makes an appearance). C# has fast length for most collections and count for any IEnumerable, similar to Java (and some popular Java libraries). Rust has std::iter::Iterator::count as the iterator version. Python has... sum(1 for i in itr)... clojure has count for iterables, etc. Chapel and C++ have count (and count_if) methods a bit like our two-argument count. I didn't come across any language that follows the one-argument semantics of counting trues.

A quick trip to the thesaurus only left me with tally, which isn't great... I can't think of anything else at the moment.

We already have a definition for count([true, false, true]) that returns 2.

I note that so does sum([true, false, true]); in fact the one-argument case seems entirely redundant with sum - except more restrictive?

tkf · 2020-05-27T05:11:13Z

Damn - naming things is hard.

Totally agree...

Thanks for the survey. Yeah, it's a bit unfortunate that what we have is incompatible with other languages. Looking at this, countif sounds like a better option.

in fact the one-argument case seems entirely redundant with sum - except more restrictive?

Two-argument case as well, I think.

andyferris · 2020-05-27T07:09:32Z

Two-argument case as well, I think.

Yes - good point!

tkf · 2020-06-02T01:03:02Z

Another iterate-based method I frequently want is to get the last item of an iterable:

lastitem(xs) = foldl(right, xs)
right(_, x) = x

We can't use last since it's documented as O(1).

Like nitems, it can also have some optimizations:

lastitem(x::AbstractArray) = last(x)
lastitem(x::Reverse) = firstitem(x.itr)
lastitem(x::Filter) = firstitem(filter(x.flt, reverse(x.itr)))
lastitem(x::Generator) = x.f(lastitem(x.iter))
lastitem(x::Flatten) = lastitem(lastitem(x))

and so on. It'd also be better to have firstitem so that lastitem(::Reverse) and lastitem(::Filter) can be implemented efficiently.

I started to wonder if it makes sense to consistently use item in the name when it has iterate-based fallback. It's somewhat random but it's good to be predictable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a default implementation of `length` using `iterate`. #35947

Add a default implementation of `length` using `iterate`. #35947

andyferris commented May 20, 2020 •

edited

tkf commented May 20, 2020

JeffBezanson commented May 20, 2020

andyferris commented May 20, 2020

tkf commented May 20, 2020

andyferris commented May 20, 2020

andyferris commented May 20, 2020

andyferris commented May 20, 2020 •

edited

tkf commented May 21, 2020

JeffBezanson commented May 22, 2020

tkf commented May 22, 2020

martinholters commented May 25, 2020

tkf commented May 27, 2020

andyferris commented May 27, 2020 •

edited

tkf commented May 27, 2020

andyferris commented May 27, 2020 •

edited

tkf commented May 27, 2020

andyferris commented May 27, 2020 •

edited

tkf commented May 27, 2020

andyferris commented May 27, 2020

tkf commented Jun 2, 2020

Add a default implementation of length using iterate. #35947

Are you sure you want to change the base?

Add a default implementation of length using iterate. #35947

Conversation

andyferris commented May 20, 2020 • edited

tkf commented May 20, 2020

JeffBezanson commented May 20, 2020

andyferris commented May 20, 2020

tkf commented May 20, 2020

andyferris commented May 20, 2020

andyferris commented May 20, 2020

andyferris commented May 20, 2020 • edited

tkf commented May 21, 2020

JeffBezanson commented May 22, 2020

tkf commented May 22, 2020

martinholters commented May 25, 2020

tkf commented May 27, 2020

andyferris commented May 27, 2020 • edited

tkf commented May 27, 2020

andyferris commented May 27, 2020 • edited

tkf commented May 27, 2020

andyferris commented May 27, 2020 • edited

tkf commented May 27, 2020

andyferris commented May 27, 2020

tkf commented Jun 2, 2020

Add a default implementation of `length` using `iterate`. #35947

Add a default implementation of `length` using `iterate`. #35947

andyferris commented May 20, 2020 •

edited

andyferris commented May 20, 2020 •

edited

andyferris commented May 27, 2020 •

edited

andyferris commented May 27, 2020 •

edited

andyferris commented May 27, 2020 •

edited