Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add std.range.enumerate #1866

Merged
merged 1 commit into from
Sep 20, 2014
Merged

Conversation

JakobOvrum
Copy link
Member

http://d.puremagic.com/issues/show_bug.cgi?id=5550

This could also be implemented in the language (though potentially ambiguous with automatic foreach unpacking), but here's a library solution for review.

Ping @bearophile

@JakobOvrum
Copy link
Member Author

Added the start parameter per bearophile's suggestion in the issue report.

@Kozzi11
Copy link
Contributor

Kozzi11 commented Jan 17, 2014

Is it possible to add opApply for faster iteration?

@monarchdodra
Copy link
Collaborator

Is it possible to add opApply for faster iteration?

I believe that unless you can really justify it, ranges with opApply are slower, because of inlining issue: The compiler creates a delegate of the loop body, and passes that to the foreach, which in turn makes a run-time call to said delegate, throwing all inlining out the window.

...is what I remember from last time I touched opApply.

@Kozzi11
Copy link
Contributor

Kozzi11 commented Jan 17, 2014

Wow I always think that opApply should be faster. Thanks.
Dne 17. 1. 2014 20:11 "monarch dodra" notifications@github.com napsal(a):

Is it possible to add opApply for faster iteration?

I believe that unless you can really justify it, ranges with opApply
are slower, because of inlining issue: The compiler creates a delegate
of the loop body, and passes that to the foreach, which in turn makes a
run-time call to said delegate, throwing all inlining out the window.

...is what I remember from last time I touched opApply.


Reply to this email directly or view it on GitHubhttps://github.com//pull/1866#issuecomment-32637128
.

@Kozzi11
Copy link
Contributor

Kozzi11 commented Jan 17, 2014

dlang.org/phobos/std_range.html#.opApply
Dne 17. 1. 2014 21:07 "Daniel Kozak" kozzi11@gmail.com napsal(a):

Wow I always think that opApply should be faster. Thanks.
Dne 17. 1. 2014 20:11 "monarch dodra" notifications@github.com
napsal(a):

Is it possible to add opApply for faster iteration?

I believe that unless you can really justify it, ranges with opApply
are slower, because of inlining issue: The compiler creates a delegate
of the loop body, and passes that to the foreach, which in turn makes a
run-time call to said delegate, throwing all inlining out the window.

...is what I remember from last time I touched opApply.


Reply to this email directly or view it on GitHubhttps://github.com//pull/1866#issuecomment-32637128
.

@JakobOvrum
Copy link
Member Author

dlang.org/phobos/std_range.html#.opApply

That's for the InputRange interface, i.e. its range primitives are virtual. Most range types, including the one returned from enumerate, are implemented using structs, which have no virtual functions.

@Kozzi11
Copy link
Contributor

Kozzi11 commented Jan 18, 2014

OK. Thanks again for explanation. Btw. I try both and it seems there is no
difference between them.
Dne 18. 1. 2014 8:29 "JakobOvrum" notifications@github.com napsal(a):

dlang.org/phobos/std_range.html#.opApply

That's for the InputRange interface, i.e. its range primitives are
virtual. Most range types, including the one returned from enumerate, are
implemented using structs, which have no virtual functions.


Reply to this email directly or view it on GitHubhttps://github.com//pull/1866#issuecomment-32676557
.

static struct Result
{
private Range _range;
private size_t index;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nitpick: probably better to explicit write private size_t index=0; to make code clearer, even though .init is 0.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It used to have that - but now it's explicitly initialized unconditionally at the bottom of enumerate. The default is never used.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick, but you have _range and index. I understand why you did it like that, but I think you should use a consistent naming scheme. Arguably, using range is perfectly fine, as the shadowing actually protects you from calling the wrong "range".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, shadowing here is legal (makes sense). Thanks, I changed it.

@quickfur
Copy link
Member

Other than those minor nitpicks, LGTM.

ElemType front() @property
{
assert(!_range.empty);
return typeof(return)(index, _range.front);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this construction seems to be not so clear. Would not be bettr to just use:

auto front() @property
{
    assert(!_range.empty);
    return tuple(index, _range.front);
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tuple!(size_t, ElementType!Range) is not implicitly convertible to ElemType. Tuple.opAssign can handle this kind of conversion, but it doesn't work when ElementType!Range is non-mutable.

edit:

Note how ElemType is used to introduce the named fields.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And what about:

return ElemType(index, _range.front);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DRY

edit:

It's a fairly common Phobos idiom:

$ grep -r "typeof(return)(" std
std/algorithm.d:    return typeof(return)(r);
std/algorithm.d:    return typeof(return)(needle);
std/algorithm.d:    return typeof(return)(range, sentinel, openRight);
std/algorithm.d:    return typeof(return)(range, openRight);
std/algorithm.d:    return typeof(return)(r);
std/algorithm.d:    return typeof(return)(index);
std/algorithm.d:    return typeof(return)(rs);
std/algorithm.d:    return typeof(return)(ranges);
std/algorithm.d:    return typeof(return)(r1, r2);
std/algorithm.d:    return typeof(return)(r1, r2);
std/algorithm.d:    return typeof(return)(ror);
std/base64.d:        return typeof(return)(range);
std/base64.d:        return typeof(return)(range);
std/complex.d:        return typeof(return)(ab * std.math.cos(ar), ab * std.math.sin(ar));
std/complex.d:    return typeof(return)(cs.im * csh.re, cs.re * csh.im);
std/complex.d:    return typeof(return)(cs.re * csh.re, - cs.im * csh.im);
std/complex.d:        c = typeof(return)(0, 0);
std/complex.d:            c = typeof(return)(w, z_im / (w + w));
std/complex.d:            c = typeof(return)(z_im / (w + w), w);
std/net/curl.d:            return typeof(return)(size_t.max, cast(Char[])null);
std/net/curl.d:    return typeof(return)(startLen-data.length, res);
std/numeric.d:    return typeof(return)(r1, r2, penalty);
std/range.d:    return typeof(return)(state);
std/range.d:    return typeof(return)(tuple(args));
std/range.d:    return typeof(return)(rr);
std/range.d:    return typeof(return)(rr, n);
std/range.d:    return typeof(return)(source, indices);
std/range.d:    return typeof(return)(source, chunkSize);
std/range.d:            return typeof(return)(index, _range.front);
std/range.d:                    return typeof(return)(index + _range.length - 1, _range.back);
std/range.d:                return typeof(return)(index + i, _range[i]);
std/stdio.d:        return typeof(return)(f, format);
std/stdio.d:            return typeof(return)(this, format);
std/typecons.d:    return typeof(return)(args);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So then it is ok :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicitly using ElemType is IMO better looking. If you're really worried about DRY you could use return type inference. I assume most of the uses in phobos do not have a simple alias for the type already available.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like relying on return type inference when the return type is trivial, as I think it helps readability to know the return type without looking at the body of the function. Also, while it may not make a big difference here (though it's not impossible that some time in the future we may decide that we wanted a public Enumerate type, i.e. the old style), I think it's good practice in general, as it improves documentation and DI generation too. Putting the trivial return type explicitly and then using typeof(return) in the body seems like the overall best style to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree. If you feel strongly about the return type being present (even though the user will have to look up what ElemType is anyway) I think using the alias in both places is best. typeof(return) does not give any advantage over just using the type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has all the advantages of DRY.

@JakobOvrum
Copy link
Member Author

Added assert(!_range.empty); to back and popBack.

@JakobOvrum
Copy link
Member Author

The tests are now pure @safe nothrow.

auto enumerate(Range)(Range range, size_t start = 0)
if (isInputRange!Range)
{
alias ElemType = Tuple!(size_t, "index", ElementType!Range, "value");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be inside the result struct itself?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no correctness issues with it, but stylistically it could be argued either way. It's probably best inside the struct, I'll move it.

edit:

Done.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, forgot to make it private when I moved it in. Fixed (changed to use private: because of the 80 column limit).

@monarchdodra
Copy link
Collaborator

I discovered and filed this:
https://d.puremagic.com/issues/show_bug.cgi?id=11955

In a word, if you explicit request size_t in loop, it might fail. EG:

foreach (size_t lineNum, line; stdin.byLine().enumerate(1)) //Note "size_t"
    stdout.writefln("line #%s: %s", lineNum, line);

This fails on my win_32 (2.064.2). I haven't tested any other builds.

{
return Result(range[i .. j], index + i);
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Range is infinite, this will fail, as the type of range[i .. j] will not match Range's. Because the function is not template, it will straight up create a compile error for something like: foreach(i, v; repeat(1).enumerate()) (so I guess that means they aren't tested...).

If you want to support them, then it goes something like this:

        static if (hasSlicing!Range)
        {
            static if (hasLength!Range) //Finite RA
            {
                Result opSlice(size_t i, size_t j)
                {
                    return Result(range[i .. j], index + i);
                }
            }
            else //Infinite with slicing
            {
                static struct DollarToken {}
                enum opDollar = DollarToken.init;

                //Slice to end with matching type.
                Result opSlice(size_t i, DollarToken)
                {
                    return Result(range[i .. $], index + i);
                }

                //Finite slice.
                //The type returned needs to be Take!Result to satisfy the hasSlicing primitive.
                auto opSlice(size_t i, size_t j)
                {
                    return this[i .. $].take(j - 1);
                }
            }
        }

Do NOT simply have opSlice return auto. It is important for infinite ranges that myEnumerate[1 .. 2] should return a Take!(Result!Range) type, and not a Result!(Range!Take) type (which would be what you'd get with the "dumb" implementation).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'd never used infinite range slicing before so this is useful info. This has been fixed in the latest push. One nitpick: I chose to use takeExactly instead of take - of course, they're equivalent when used on infinite ranges, but take is conceptually more complicated than takeExactly from a readability standpoint.

@JakobOvrum
Copy link
Member Author

Now supports infinite ranges properly and has a documented and enforced overflow policy.

The tests would have catched the infinite range slice problem if dummyRanges covered slicing and such... I'm trying to find the sweet spot where the tests are comprehensive but not unwieldy. dummyRanges helps with that, so maybe we should give it some attention to bring it up to date (I guess I just volunteered).

At any rate, please take another look (docs in particular); feedback on the overflow stuff would be much appreciated.

P.S.
Would you (open question) prefer that I make these incremental changes as separate commits instead of amending every time? Which would be easier to review? Those separate commits could of course be squashed or otherwise fixed up before pulling.

@JakobOvrum
Copy link
Member Author

One approach that the overflow policy does not incorporate is the choice of using a BigInt index type, which I can imagine has serious uses. I think we can safely defer that to a later date when we have an isIntegerLike or whatever ends up happening.

@JakobOvrum
Copy link
Member Author

Improved wording in docs.

@dnadlinger
Copy link
Member

I wonder whether it would be nicer to actually merge the counter into the tuple for auto-expanding ranges, so that you could do foreach(i, a, b; zip(a, b).enumerate). Might not be worth the additional complications, though…

@JakobOvrum
Copy link
Member Author

I wonder whether it would be nicer to actually merge the counter into the tuple for auto-expanding ranges, so that you could do foreach(i, a, b; zip(a, b).enumerate). Might not be worth the additional complications, though…

It's fairly easy to implement, and either way is pretty much equally composable, so I guess it's just a matter of choosing a default. The current way is more regular which I guess can mean less special cases in generic code, and it also means one less thing the user has to know. Maybe we should have some kind of tuple flattening function instead? e.g. a.zip(b).enumerate().flatten() or a.zip(b).enumerate().map!flatten() etc.

@monarchdodra
Copy link
Collaborator

One approach that the overflow policy does not incorporate is the choice of using a BigInt index type, which I can imagine has serious uses.

I'm not sure I can see any immediate uses for this actually.

size_t on 64 bits is already really freaking huge. Further more, given that BigInt is copy on write, operator++ is actually quite the expansive operation.

If anybody really needs such a huge amount of counted iterations, I'd be willing to wager the algorithm is more complicated than a simple "foreach" over a range.

Worst case scenario, I see enumerate as convenience. That don't mean users can't still use for loops...

@monarchdodra
Copy link
Collaborator

Would you (open question) prefer that I make these incremental changes as separate commits instead of amending every time? Which would be easier to review? Those separate commits could of course be squashed or otherwise fixed up before pulling.

I find amend tends to be better, but it don't make much of a change. What I've come to learn though is that I much prefer having to review 10 trivial pulls, then a single one that does a lot of things.

@JakobOvrum
Copy link
Member Author

I'm not sure I can see any immediate uses for this actually.

size_t on 64 bits is already really freaking huge. Further more, given that BigInt is copy on write, operator++ is actually quite the expansive operation.

If anybody really needs such a huge amount of counted iterations, I'd be willing to wager the algorithm is more complicated than a simple "foreach" over a range.

enumerate is not just for foreach, but regardless, part of the beauty of ranges is that even complicated algorithms can be distilled into simple loops or no loops at all.

Regarding BigInt I was considering cases like long-running programs/threads where the input range is blocking, waiting for some kind of input. For example, it could be a server that reads application-level packets from a socket and does some processsing on them, attaching an ID to each with enumerate. In this particular case though it's common practice to just use wrap-around (so it's nice that enumerate now defines it).

@bearophile
Copy link

@klickverbot:

foreach(i, a, b; zip(a, b).enumerate)

The auto expansion (unpacking) of tuples in foreach (that I think was introduced by Kenji) is a broken not-feature that should be killed as soon as possible (warned against, deprecated and then removed), and replaced by a better and more general feature of tuple unpacking. I keep saying similar things in some Bugzilla entries.

You can see a problem here:

import std.stdio, std.algorithm, std.range;
void main() {
    foreach (x, y; zip([10], [20]))
        writeln(x, " ", y);
    foreach (x, y; zip([10], [20]).array)
        writeln(x, " ", y);
}

Ouput:

10 20
0 Tuple!(int, int)(10, 20)

@dnadlinger
Copy link
Member

@bearophile: I agree that we should have general tuple unpacking in the language, and your example indeed shows undesirable behavior.

@monarchdodra
Copy link
Collaborator

I can see value in having enumerate so that a range may also "attach" an index to its members. However, I'm worried about two things: This pull seems to mainly be aimed at being able to have indexed foreach loops: Why isn't this a language solution? Is it because it requires deprecating tuple un-packing? And if so, wouldn't users that use this just have their code broken if/when we deprecate tuple unpacking in foreach loops?

@JakobOvrum
Copy link
Member Author

I can see value in having enumerate so that a range may also "attach" an index to its members. However, I'm worried about two things: This pull seems to mainly be aimed at being able to have indexed foreach loops: Why isn't this a language solution? Is it because it requires deprecating tuple un-packing? And if so, wouldn't users that use this just have their code broken if/when we deprecate tuple unpacking in foreach loops?

Anything that can be achieved with foreach can be achieved functionally with copy and such, and enumerate facilitates this with the named fields, and while I prefer that style in my own code, I couldn't think of a good example for the documentation.

As for library vs language range foreach index; I don't feel strongly about it either way, but I don't think tuple unpacking is necessarily broken. What I do think is that we should solve this problem fairly soon, as the absence of a solution to this keeps surprising people.

One thing to keep in mind is of course that enumerate works on ranges without length as well as infinite ranges with defined behaviour detailed in the documentation. With the current forms of foreach, I don't think it's possible for the index to overflow. Also, the start parameter can help readability in some code, as shown in the example. The equivalent done manually with a + expression inside the loop body wouldn't benefit from the start + length overflow check, either.

@monarchdodra
Copy link
Collaborator

Don't know if you noticed, but you are failing the release unittests. It's because of http://d.puremagic.com/issues/show_bug.cgi?id=10939

@JakobOvrum
Copy link
Member Author

Don't know if you noticed, but you are failing the release unittests. It's because of http://d.puremagic.com/issues/show_bug.cgi?id=10939

I had no idea, thanks. I'm short on time in the next few days but I'll fix it after that; this PR was never in a hurry anyway.

@JakobOvrum
Copy link
Member Author

When the range does not provide length, overflow is defined and expected (see docs). It's a useful feature for infinite ranges and such.

@JakobOvrum
Copy link
Member Author

@WalterBright, can I somehow use core.checkedint to check if start + length overflows, when start can be an integer of any size and signage, and length is size_t? Not sure what to do with the possibly mixed signage.

@bearophile
Copy link

@JakobOvrum:

Not sure what to do with the possibly mixed signage.

This is a general question. A core.checkedint function like muls() accepts two ints or two longs. What if you have a uint and int? An example:

void main() {
    import std.stdio, core.checkedint;
    int x = -1;
    uint y = 3_000_000_000;
    writeln(x, " ", y);
    writeln(x * y);
    bool overflow = false;
    immutable r1 = muls(x, y, overflow);
    writeln(r1, " ", overflow);
    overflow = false;
    immutable r2 = mulu(x, y, overflow);
    writeln(r2, " ", overflow);
}

It outputs:

-1 3000000000
1294967296
1294967296 false
1294967296 true

Here the overflow boolean from the muls() is false, but the result is wrong (it's -3_000_000_000 that is not representable by both ints and uints).

@DmitryOlshansky
Copy link
Member

So is this blocked on overflow check decision ?

@DmitryOlshansky
Copy link
Member

In this particular case if at least one operand is signed then we need to convert (and test) the second one to!Signed!T(x). Then there are 2 overflows possible: during addition and during conversion of length to signed.

@JakobOvrum
Copy link
Member Author

Then there are 2 overflows possible: during addition and during conversion of length to signed.

Thanks, I think I can work with that. I'm travelling at the moment and won't settle for another few weeks, so I probably won't be able to work with D until then.

Fixes issue 5550
@JakobOvrum
Copy link
Member Author

Alright, overflow checking now works. Thanks @DmitryOlshansky.

The two points of interest are the in-contract that does the overflow checking, and the last unittest, which tests overflow. Coverage shows all paths are tested (except the assert(false) paths).

If anyone wants to test the last unittest locally, as it's not tested by the auto tester (see the linked bug), remember to change version(none) to version(all) .

Please review.

@mihails-strasuns
Copy link

Overall looks solid but I don't have any good knowledge of checkedint stuff.

Is versioned out test failing because auto-tester builds unittests in release mode?

@JakobOvrum
Copy link
Member Author

Is versioned out test failing because auto-tester builds unittests in release mode?

I think so; asserts are enabled but in-contracts are removed, so the asserts testing the functionality of the in-contract fail.

@DmitryOlshansky
Copy link
Member

Seems to pointlessly sit there waiting for more reviewers.
Unless some objection comes, goin' to merge in day or two.

@mihails-strasuns
Copy link

Only controversial issue is not directly related to this PR so there is no reason to not merge it. I'l go ahead.

@mihails-strasuns
Copy link

Auto-merge toggled on

mihails-strasuns pushed a commit that referenced this pull request Sep 20, 2014
@mihails-strasuns mihails-strasuns merged commit 2ad0698 into dlang:master Sep 20, 2014
@DmitryOlshansky
Copy link
Member

Going to mark new things as changelog_2.067 for now, though it's apparent that it's going to get into 2.068 in reality.

@mihails-strasuns
Copy link

Well it is quite possible that 2.067 will be re-branched from master when Andrew will be able to return to this. No release activity was happening for 2.067 anyway

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants