Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue 5968 #1186

Closed
wants to merge 5 commits into from
Closed

Fix issue 5968 #1186

wants to merge 5 commits into from

Conversation

andralex
Copy link
Member

@andralex andralex commented Mar 4, 2013

Added a real groupBy facility. Making it single-pass was interesting.

/**
Returns $(D true) if $(D obj) is $(D null).
*/
bool isNull(T)(auto ref RefCounted!T obj)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this function in module scope? Shouldn't it be put in the RefCounted implementation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to not hide a method that the wrapped type might have with the same name.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra DDoc explaining why this function is written the way it is, as well as a recommendation to NOT use it UFCS-style would help. I suppose using it UFCS style will call T's member IsNull with higher priority... correct?

@quickfur
Copy link
Member

This range appears to be non-transient safe, as it saves .front in _model and assumes that it will not change upon calling popFront.

To be transient-safe, a forward range is required, and .save should be used in lieu of assigning .front to a temporary. In the case of an input range, transient-safety is impossible, so we can ignore that case.

@quickfur
Copy link
Member

Actually, wait, this range requires a forward range. If R is not a forward range, the ctor of the outer range will fail (it calls data.save where data is of type R).

@quickfur
Copy link
Member

ping

@andralex
Copy link
Member Author

andralex commented Oct 9, 2013

oops sorry didn't see your comments, will look at this

@andralex
Copy link
Member Author

ping

@monarchdodra
Copy link
Collaborator

This seems to duplicate the functionality of #1453 's chunkBy (or the other way around). Implementation aside, the only difference I see is having a unary pred vs a binary pred. Right?

IMO, for something that groups an arbitrary amount of elements together, it makes more sense to have a unary evaluation, than pair-wise comparison.

I mean:

goupBy!"toLower(a)"("sdqfkhUPIUmKSJHmgfue");
//vs
goupBy!"toLower(a) == toLower(b)"("sdqfkhUPIUmKSJHmgfue");

Although grouping after sorting is a common pattern, the range doesn't
have to be sorted or structured in a particular way.
*/
struct GroupBy(alias pred, R) if (isForwardRange!R)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current stance in phobos is to make the struct type private, and to only provide the adapter function that returns an auto.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@monarchdodra
Copy link
Collaborator

Doesn't this implementation imply that doing a popFront on the GroupBy imply that the previous individual groups get invalidated? And that's even if the range is a forward range. I think both this and ChunkBy have the same flaw.

What's wrong with the approach taken by Splitter!Pred ? Simply find the length of the chunk, and return a slice/take. It's a bit less lazy, but it works with no surprises... ?

@andralex
Copy link
Member Author

This seems to duplicate the functionality of #1453 's chunkBy (or the other way around). Implementation aside, the only difference I see is having a unary pred vs a binary pred. Right?

Yah, it would be fair to say #1453 is the duplicate seeing as this predates it by 5 months. We'll choose the better one but it's a bit of a bummer to see there's duplication of work. Fortunately it's not frequent, but I wonder what to do to prevent such.

IMO, for something that groups an arbitrary amount of elements together, it makes more sense to have a unary evaluation, than pair-wise comparison.

I mean:

groupBy!"toLower(a)"("sdqfkhUPIUmKSJHmgfue");
//vs
groupBy!"toLower(a) == toLower(b)"("sdqfkhUPIUmKSJHmgfue");

Yah, I think unary predicates are nice as well. But the entire Phobos was designed with binary predicates, e.g. we use sort!"a.length < b.length"(range), not sort!"a.length"(range). So a unary-based algorithm would stand like a sore thumb.

Also, binary predicates may be more powerful than unary predicates. For example, to find sorted runs in a range one may write:

groupBy!"a <= b"(range);

I don't think that's expressible with unary predicates.

@andralex
Copy link
Member Author

Doesn't this implementation imply that doing a popFront on the GroupBy imply that the previous individual groups get invalidated? And that's even if the range is a forward range. I think both this and ChunkBy have the same flaw.

I'll look into this.

What's wrong with the approach taken by Splitter!Pred ? Simply find the length of the chunk, and return a slice/take. It's a bit less lazy, but it works with no surprises... ?

Laziness is essential here seeing as a group may span the entire range, or an arbitrarily long subset of it. Making a O(n) pass just to popFront would make the facility useless for many applications.

@andralex
Copy link
Member Author

Doesn't this implementation imply that doing a popFront on the GroupBy imply that the previous individual groups get invalidated? And that's even if the range is a forward range. I think both this and ChunkBy have the same flaw.

I don't think that's the case, or I'm not sure I get what the failure scenario is. Here's a preexisting unittest (after I changed it to take in the equal!equal hint). Did you have something else in mind?

unittest
{
    // Test groups saved out of order
    auto a = [ 1, 1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 6, ];
    auto g1 = groupBy(a);
    auto g2 = g1.save;
    auto firstGroup1 = g1.front;
    auto firstGroup2 = firstGroup1.save;
    g1.popFront();

    auto witness2 = [ [1, 1, 1], [2, 2], [3, 3, 3], [4], [5, 5], [6], ];
    assert(equal!equal(g2, witness2));

    auto witness1 = [ [2, 2], [3, 3, 3], [4], [5, 5], [6], ];
    assert(equal!equal(g1, witness1));

    assert(firstGroup1.equal([1, 1, 1]));
    assert(firstGroup2.equal([1, 1, 1]));
}

@andralex andralex mentioned this pull request Dec 25, 2013
@monarchdodra
Copy link
Collaborator

Did you have something else in mind?

It works because slices are value semantic, but will fail with a reference semantic range, or a simple input range.

auto myGroupBy = myRefRange.groupBy();
auto myFirstGroup = myGroupBy.front;
myGroupBy.popFront();
writeln(myFristGroup.front);

Here, the myGroupBy.popFront() will result in popping myRefRange until the first group is finished. However, myFirstGroup also references the same myRefRange, and will produce wrong results. Whenever GroupBy returns a Group, that group must carry a saved copy of the traversed range.

This puts us in the awkard position (common to all functions that split a range), to operate exclusively on forward ranges (and not accept pure input), or to accept input ranges, but have a transient behavior (like stdio.byChunk) :/

@quickfur
Copy link
Member

A fundamental issue we have to deal with when working with range-splitting functions, is that because the result is a range of ranges, it forces you to either do duplicate work (unless your range has slicing), or make subranges transient. This applies not just to this function (or chunkBy), but to any range-splitting function.

Consider this. If the given range R doesn't have slicing, then the only way to compute the subranges is to iterate over each element and evaluate some kind of predicate that tells you where the subrange ends. The problem is, if you don't operate on a .save'd copy of R while iterating over the subrange (e.g. if R is an input range), then when you call .popFront on the outer range it will instantly invalidate the subrange. But if you do operate on a .save'd copy of R in the subrange, then the outer range's popFront will have to do duplicate work: since it doesn't know where the current subrange is (it's operating on the original R, not the .save'd copy), it has to re-iterate over the entire subrange in order to get to the next subrange. So if the caller is processing every element of every subrange, this means that we're effectively traversing R twice, even though we technically shouldn't have to.

@monarchdodra
Copy link
Collaborator

Consider this. If the given range R doesn't have slicing, then the only way to compute the subranges is to iterate over each element and evaluate some kind of predicate that tells you where the subrange ends.

You can count the elements, and return a take(range.save, slice); This requires saving before starting the iteration, but there is no duplicate work.

@quickfur
Copy link
Member

@monarchdodra Good point! Though you still have to pop through all the elements of the original range twice, if the caller wants to process every element. So there's still some overhead to pay for as opposed to directly iterating over the original range and detecting subrange boundaries in-situ instead of using a range adaptor.

@andralex
Copy link
Member Author

Did you have something else in mind?

It works because slices are value semantic, but will fail with a reference semantic range, or a simple input range.
...
This puts us in the awkard position (common to all functions that split a range), to operate exclusively on forward ranges (and not accept pure input), or to accept input ranges, but have a transient behavior (like stdio.byChunk) :/

Yah, so that's why the present code requires a forward range. My question is, do you see any correctness issue with it? (I think it saves wherever it needs.)

You can count the elements, and return a take(range.save, slice); This requires saving before starting the iteration, but there is no duplicate work.

Again, counting the elements in the group as the first step in starting a new group is quite a drag. First off, it does nothing in the way of addressing the input range issue, so it has zero advantage over the present code. Second, a group may span essentially the entire range and so we can't masquerade an O(n) operation as a O(1) step, even though that would be amortized later.

So the questions here are:

  1. Is the present forward-range-requiring code correct?
  2. Do we care about supporting input ranges with groupBy?
  3. If so, what form of support do we need to add?

@andralex
Copy link
Member Author

A fundamental issue we have to deal with when working with range-splitting functions, is that because the result is a range of ranges, it forces you to either do duplicate work (unless your range has slicing), or make subranges transient. This applies not just to this function (or chunkBy), but to any range-splitting function.

Where is the duplicate work here?

@monarchdodra
Copy link
Collaborator

Where is the duplicate work here?

I need to review what your code does in more details. Actually, I think I had misunderstood what it does. I think we may want to study it real closely: If it works, it may be something we want to push out to our other RoR's.

In any case, I think you need to call save as you return front, or a usecase as trivial as assert(equal(myGroups.front, myGroups.front)); will fail... (for reference ranges) :/

Where is the duplicate work here?

Also, keep in mind that with your design, save now allocates... I think to remember you saying you wanted algorithm to not allocate as much as possible.

@andralex
Copy link
Member Author

Where is the duplicate work here?

I need to review what your code does in more details. Actually, I think I had misunderstood what it does. I think we may want to study it real closely: If it works, it may be something we want to push out to our other RoR's.

Possibly. It's not trivial but effective: ranges share information about a "serial number" and its evolution to eliminate duplicate O(n) advances.

In any case, I think you need to call save as you return front, or a usecase as trivial as assert(equal(myGroups.front, myGroups.front)); will fail... (for reference ranges) :/

That's is a problem on the caller side. The call equal(r, r) will fail with many reference ranges.

Where is the duplicate work here?

Also, keep in mind that with your design, save now allocates... I think to remember you saying you wanted algorithm to not allocate as much as possible.

The better goal is to not produce garbage I think.

Merge now worry about input ranges later? We can always expand capabilities without breaking compatibility.

@monarchdodra
Copy link
Collaborator

That's is a problem on the caller side. The call equal(r, r) will fail with many reference ranges.

Well... equal(r, r), yes.

I really meant equal(range.front, range.front), I'm not sure it should. Well... actually, I guess it makes sense that it does.

In that case, I think it looks good in the broad lines. I'll give it a fuller review when I have the time.

@andralex
Copy link
Member Author

In that case, I think it looks good in the broad lines. I'll give it a fuller review when I have the time.

Thanks! Quick, before it bitrots :o)

destroy(this);
_refCounted._store = null;
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is issue 9636 (https://d.puremagic.com/issues/show_bug.cgi?id=9636), which is "controversial", as it changes the (already existing) semantic of what Nullable!T n; n = null;, if T happens to be a type for which = null; already means something. EG: A Nullable!(int*)...

If you still believe in this change, I'd recommend you at least give it its own pull.

@monarchdodra
Copy link
Collaborator

Thanks! Quick, before it bitrots :o)

As quick as I can. But this pull should have no conflict, and is now on my list of pull to review (and to pull), so there is little to no chances it will rot.

@andralex
Copy link
Member Author

@monarchdodra ok I undid all changes to RefCounted.

@property bool empty()
{
auto result = _thisGroup.empty
|| !binaryFun!pred(_thisGroup.front, _model);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You compare with _model here, but you never update _model: You just initialize it once. This would be fine for a pred with equivalence, but you sold byGroup as being able to return, say "Groups of sorted items with a <= b". However, with the current scheme (I think, not actually tested), you'll only return groups where every item is bigger than the first, not the previous.

Could you add a sorted unittest? Say:

[1, 3, 2, 4, 5, 1].byGroup!"a < b"().equal([[1, 3], [2, 4, 5], [1]]);

I think this fails with the current scheme...

@andralex
Copy link
Member Author

This works as intended, but raises an interesting question - please don't pull yet. Will ask in the forum.

@monarchdodra
Copy link
Collaborator

Hum... Your "GroupBy" holds both a "Group", and a "SharedAllGroups". This gives it an ambiguous mix of both value and reference semantics. EG:

void main()
{
    auto a = groupBy([1, 3, 2, 2, 3, 3, 4, 4]);
    auto b = a;
    b.popFront();
    b.popFront();
    writeln(a);
}
[[1, 1], [3, 3], [4, 4]]

That's just strange. That said, why not just store your private Group _front; inside the SharedInput function? Not only would it make GroupBy behave as a true reference semantic range, it would (should) even make everything simpler: Just implement the functions on SharedInput as value semantic.

Speaking of my above "class" comment, it might even be simplest to make all of GroupBy as a single final class. The class instance would "be" the "SharedData", and each Group would simply have a reference to this main GroupBy instance. You'd still have to keep the id member to know which Group is the "latest". All in all, it would seem like an interesting design. I think.

I also want to point out that this range would be one the first reference semantic range in all of phobos! I think this is a good initiative.

@andralex
Copy link
Member Author

I get [[1], [2, 2], [3, 3], [4, 4]] which is indeed odd. I'll investigate. Plopping front inside the refcounted thingie is an interesting idea.

@monarchdodra
Copy link
Collaborator

I get [[1], [2, 2], [3, 3], [4, 4]] which is indeed odd.

That is odd, since I duped your branch :/ well, no matter.

void popFront()
{
// Walk the current group through its end
for (;; _front.popFront)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get what this is trying to do, but if R is a "value range" (99% of the time), then this is duplicate work, since _front will never actually be already empty.

@DmitryOlshansky
Copy link
Member

@andralex - status on this one?

@mihails-strasuns
Copy link

Will ask in the forum.

For other readers: relevant thread is http://forum.dlang.org/post/l9ooqk$1596$1@digitalmars.com

I think this should be closed in favor of #1453 and any missing functionality added there (as it has a more "modern" implementation). @andralex any objections?

@@ -3726,6 +3726,192 @@ unittest
}
}

/**
Given a forward range $(D r), returns a range of ranges that group adjacent
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Giten a forward range $(D r), this returns a range.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants