Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

i, := and Cartesian product (another allow.cartesian case) #800

Closed
matthieugomez opened this issue Sep 4, 2014 · 6 comments
Closed

i, := and Cartesian product (another allow.cartesian case) #800

matthieugomez opened this issue Sep 4, 2014 · 6 comments
Assignees
Labels
Milestone

Comments

@matthieugomez
Copy link
Contributor

DT=data.table(
  id = c(1, 1),
  date = c(1992, 1991),
  value = c(4.1, 4.5),
  key = "id"
)

DT[DT, a := 1]
"Error in vecseq(f__, len__, if (allow.cartesian || notjoin) NULL else as.integer(max(nrow(x),  : 
  Join results in 4 rows; more than 2 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including `j` and dropping `by` (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice."
DT[DT, a := 1, allow.cartesian = TRUE]

does not throw any error even though it does not do a cartesian product.

Update: Also, I'm not clear on what happens when multiple rows in i corresponds to a row in X, as in

DT1 <- unique(DT)
DT1[DT, a := i.value]

It seems to match to the last row in i in this example. Is it documented somewhere?

@eantonya
Copy link
Contributor

eantonya commented Sep 4, 2014

This is because DT[DT] has 4 rows, but this specific error is pretty annoying and I bump against it all the time. The allow.cartesian error should not be there when doing an assignment, as the result is never larger than DT. Just a warning should be enough.

Changing the label to feature request.

@matthieugomez
Copy link
Contributor Author

Great. Could you answer my second point too (you may not have seen it as I added it later on)

@eantonya
Copy link
Contributor

eantonya commented Sep 4, 2014

I have no idea what's going on there :)

My best guess is that since you try to assign to a single line the value c(1,2), it first assigns a=1, then loops and assigns a=2, so you're left with the second value at the end.

@arunsrinivasan
Copy link
Member

One way to comprehend what's going on is to realise that, even for x[i] where i is a data.table, under-the-hood, the matching row numbers of x is first obtained using binary search. And then a subset using those row numbers is returned.

So when you do:

DT1 = unique(DT)
DT1[DT, a := 1L]

internally, it first computes the matching rows in x for i, which can also be seen using:

DT1[DT, which=TRUE]
# [1] 1 1

That would transform your earlier code to:

DT1[c(1L,1L), a := 1L]

which is quite straightforward to understand as to what's happening, and is correct behaviour. Take for example:

x <- 1:5
x[c(1L,1L)] <- 10L

If you'd instead done:

DT1[DT, a := c(2,4)]
#    id date value a
# 1:  1 1992   4.1 4

and is correct once again.

However, on your first point, even though i has duplicate values (and therefore the error message is right), the fact that the end result will never be greater than nrow(x) warrants for that error message to be a bug.

I'll add this to the other allow.cartesian issues, set for fixing in the next release.

@arunsrinivasan arunsrinivasan added this to the 2.0.2 milestone Sep 6, 2014
@arunsrinivasan arunsrinivasan changed the title i, := and Cartesian product i, := and Cartesian product (another allow.cartesian case) Sep 6, 2014
@matthieugomez
Copy link
Contributor Author

Thanks. Basically you're saying that it is due to the fact that, in base R,

 l=seq(1,10)
 l(c(1,1))=c(3,4)

modifies the first element by 4, not 3, right? I did not know about that. Thanks!

@rsaporta
Copy link
Contributor

The error is coming from the call to vecseq at line 520 in data.table.r
Namely:

    if (is.data.table(i)) {
        .
        .
        if (mult=="all") {
            if (!byjoin) {
                irows = if (allLen1) f__ else vecseq(f__,len__,if(allow.cartesian || notjoin) NULL else as.integer(max(nrow(x),nrow(i))))
                .
                .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants