Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement push_back and push_front for DataFrame. #1099

Merged
merged 7 commits into from
Aug 4, 2020
Merged

Implement push_back and push_front for DataFrame. #1099

merged 7 commits into from
Aug 4, 2020

Conversation

waltersom
Copy link
Contributor

This explicitly defines push_back and push_back methods for DataFrame
which copies row.names and class attributes from the original object.

This fixes #1094 where these operations converted the object to a list.

Tests for this behavior are also implemented.

Checklist

  • Code compiles correctly
  • R CMD check still passes all tests
  • Prefereably, new tests were added which fail without the change
  • Document the changes by file in ChangeLog

@waltersom
Copy link
Contributor Author

This does not enforce that the vectors are the same length (existing package doesn't either).

@waltersom
Copy link
Contributor Author

The new version uses set__() in DataFrame.h, which has the huge benefit of padding/replicating vectors of different lengths, making names, and so forth.

@waltersom
Copy link
Contributor Author

I still feel this is a bit inelegant as it only solves the problem for data frames, but I guess it is also class-specific how attributes should change when they are combined, so it might be better that any class that has a problem implements the appropriate fix like this. I'm not sure if any other classes suffer from this problem.

@eddelbuettel
Copy link
Member

How does anybody else feel about changing the names of these verbs? push_back et al just invoked appending to a single vector. Maybe append_column and prepend_column (both clumsy) or add_col_at_back and add_col_at_front? Maybe with column instead of col ?

@waltersom
Copy link
Contributor Author

push_* does sound a bit clumsy, but I am a bit concerned that it is possible to combine two data frames this way (or add a list to a data frame, I guess), and not just add a column. Would it work with just append and prepend?

@kevinushey
Copy link
Contributor

My thoughts:

  1. We should keep push_back() / pop_back(), to conform with the existing Vector interface.
  2. If we feel it's worth it, we could add append_column() / prepend_column() as aliases for those functions. I don't have strong feelings here though.

I still feel this is a bit inelegant as it only solves the problem for data frames, but I guess it is also class-specific how attributes should change when they are combined, so it might be better that any class that has a problem implements the appropriate fix like this. I'm not sure if any other classes suffer from this problem.

I agree it'd be nice to fix this in general, but it seems hard to find the right solution that would work "in general" so it's okay to keep things scoped more narrowly here for now.

@eddelbuettel
Copy link
Member

Happy to be majority-vote overruled but I still feel quite strongly that push_back() is not a good idea. It (to me) has clear connotations of extending a single vector ("verticallly" if you wish) whereas the PR uses to expand a list holding such vectors by enlarging a `data.frame ("horizontally"). Nothing good can come from mixing terminology that way.

@kevinushey
Copy link
Contributor

My main concern is that DataFrame.push_back() already does exist (by virtue of being available in the base class) and so I don't think we should change the pre-existing semantics there.

If this were a clean-room rebuild of the class, I would definitely be on board though.

@eddelbuettel
Copy link
Member

I agreee with that point @kevinushey. That is indeed more important. Having the aliases is then a good compromise.

@eddelbuettel
Copy link
Member

Thanks for your patience. Rcpp 1.0.5 is now on CRAN, I am currently looking into #1098 and will turn to this afterwards. Likely in a day or two.

This explicitly defines push_back and push_back methods for DataFrame
which copies row.names and class attributes from the original object.

This fixes #1094 where these operations converted the object to a list.

Tests for this behavior are also implemented.
Make use of the fact set__() already does a conversion, and just use
that. This has the benefit that it works with push_back(DataFrame), as
well as push_back(Vector), and also handles adding vectors of different
lengths properly.
@eddelbuettel
Copy link
Member

I rebased this relative to the updated main branch and pushed the update. I have a local (not-yet committed) change adding ChangeLog and setting this as 1.0.5.2 and am now running reverse depends. We will know more by tomorrow this time.

@eddelbuettel
Copy link
Member

Hi @waltersom and apologies for taking a moment---but I first wanted to get 1.0.5 out of the door, and then test the PR by @kevinushey which came in just before yours.

I have now done a full run, and in this case the full test paid us back. Package lidR and metacart fail under your PR. Both show a message of 'Could not convert using R function: as.data.frame.'. Could you take another look at the PR?

Many thanks!

@waltersom
Copy link
Contributor Author

No worries about the time, it's good to get it right, and there's no great rush.

I'll have a look at why they might get that error.

@waltersom
Copy link
Contributor Author

Ok, I have some minimum examples which exhibit these problems; f is representative of the problem in lidR, and g from metacart.

library(Rcpp)

f <- cppFunction('DataFrame d(){
DataFrame d = DataFrame::create();

NumericVector v1(5);
NumericVector v2(5);
DataFrame d1 = DataFrame::create(_["a"]=v1, _["b"] = v2);
d.push_back(v1);
SETLENGTH(wrap(v1), 2);
d.push_back(v2);
return d;
}')
f()
#> Error in f(): Could not convert using R function: as.data.frame.


g <- cppFunction('DataFrame d(){
DataFrame d = DataFrame::create();

NumericVector v1(5);
NumericVector v2(6);
d.push_back(v1);
d.push_back(v2);
return d;
}')
g()
#> Error in g(): Could not convert using R function: as.data.frame.

Created on 2020-07-09 by the reprex package (v0.3.0)

Using the pre-PR code, I get

library(Rcpp)

f <- cppFunction('DataFrame d(){
DataFrame d = DataFrame::create();

NumericVector v1(5);
NumericVector v2(5);
DataFrame d1 = DataFrame::create(_["a"]=v1, _["b"] = v2);
d.push_back(v1);
SETLENGTH(wrap(v1), 2);
d.push_back(v2);
return d;
}')
f()
#> [[1]]
#> [1] 0 0
#> 
#> [[2]]
#> [1] 0 0 0 0 0
as.data.frame(f())
#> Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 2, 5

g <- cppFunction('DataFrame d(){
DataFrame d = DataFrame::create();

NumericVector v1(5);
NumericVector v2(6);
d.push_back(v1);
d.push_back(v2);
return d;
}')
g()
#> [[1]]
#> [1] 0 0 0 0 0
#> 
#> [[2]]
#> [1] 0 0 0 0 0 0
as.data.frame(g())
#> Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 5, 6

Created on 2020-07-09 by the reprex package (v0.3.0)

My feeling is that both packages are constructing invalid dataframes, and probably getting them fixed would be good.

The fix for both is easy, and just involves changing a DataFrame to a List (which is essentially what it was before, after a push_back).

@eddelbuettel
Copy link
Member

My feeling is that both packages are constructing invalid dataframes

That is possible but we would probably have to find their code [and as I type this you post it, will study.]

@waltersom
Copy link
Contributor Author

@eddelbuettel I have a fix for lidR, will get one for metacart as well.

@eddelbuettel
Copy link
Member

Do you have a handle on why their length differ between columns? That seems unlikely to be on purpose.

[ And I don't recall from the top of my head whether we "help" by recycling, my instinct is to say we do not. ]

waltersom added a commit to waltersom/metacart that referenced this pull request Jul 9, 2020
The number of unique elements in different columns in x1 may be
different, which leads to attempting to add vectors of different length
into res. Previously, `push_back` on a DataFrame changed the DataFrame
to a list, which is being addressed in RcppCore/Rcpp#1099. This change,
combined with attempting to add vectors of different lengths, leads to
an error.
@waltersom
Copy link
Contributor Author

My fixes are at https://github.com/waltersom/lidR/tree/replace-dataframe-with-list and https://github.com/waltersom/metacart/tree/replace-dataframe-list.

For lidR, they don't even return the DataFrame/list, it's just a convenience, see
https://github.com/Jean-Romain/lidR/blob/9f30c6f8065b6fe6571c2d53457cfbab4a1fc3ca/src/LAS.cpp#L1214-L1215
I have to admit I don't really understand it, and it works most of the time, but there are occasions (which cause the test to fail) where they shorten a vector after adding it to the DataFrame, and it's already in another DataFrame. If we stick to a list, it works.

For metacart, I understand it even less. But the examples from ?FEmrt and ?REmrt run with this PR, and my fix to metacart.

Our push_* (as from as.data.frame) doesn't help by recycling, unless the length of a vector is 1.

@eddelbuettel
Copy link
Member

Nice work. I can't look at it end -- end of day / evening here. There usual process is to propose a change to the affected package to let it build with the existing (i.e. 1.0.5) and proposed Rcpp (where it currently only works with existing). That is usually done by email to them, politely asking for a fix in two weeks or so. Once done we can change Rcpp itself.

You've done so much already so let me be outrageous and ask directly: Would you be able to take care of preparing the patches and emails, i.e. drive all that from your end?

@waltersom
Copy link
Contributor Author

I was already preparing a PR for lidR (r-lidar/lidR#358), and am about to look at metacart next.

In both cases, the changes should work both with or without the change to Rcpp. It is only that once this (#1099) is merged, they break without the changes.

@eddelbuettel
Copy link
Member

You put that well to lidR in its 357 and 358. The fact is that Rcpp::DataFrame is not one of our strongest classes and I personally have always grown lists---precisely what you propose here---because DataFrame is half baked. Your PR will make it better so this kicks a few cans down the right road!

waltersom added a commit to waltersom/metacart that referenced this pull request Jul 9, 2020
The number of unique elements in different columns in x1 may be
different, which leads to attempting to add vectors of different length
into res. Previously, `push_back` on a DataFrame changed the DataFrame
to a list, which is being addressed in RcppCore/Rcpp#1099. This change,
combined with attempting to add vectors of different lengths, leads to
an error.
waltersom added a commit to waltersom/metacart that referenced this pull request Jul 9, 2020
The number of unique elements in different columns in x1 may be
different, which leads to attempting to add vectors of different length
into res. Previously, `push_back` on a DataFrame changed the DataFrame
to a list, which is being addressed in RcppCore/Rcpp#1099. This change,
combined with attempting to add vectors of different lengths, leads to
an error.
waltersom added a commit to waltersom/metacart that referenced this pull request Jul 9, 2020
The number of unique elements in different columns in x1 may be
different, which leads to attempting to add vectors of different length
into res. Previously, `push_back` on a DataFrame changed the DataFrame
to a list, which is being addressed in RcppCore/Rcpp#1099. This change,
combined with attempting to add vectors of different lengths, leads to
an error.
@waltersom
Copy link
Contributor Author

waltersom commented Jul 9, 2020 via email

@waltersom
Copy link
Contributor Author

One thing these packages failing highlights is that the error message when attempting to construct an invalid DataFrame is not very helpful.
It would be possible to check if vector of different (non-1) length are being added, and either print a more helpful error, or not call set__() (and thus revert to previous behaviour). There would be some (small) overhead from doing this, and it's not very elegant, but if others thought it would be helpful, I could add that.

@eddelbuettel
Copy link
Member

A warning() or error() is probably a good idea.

Food for thought: maybe make it even more complex and add an option() to fall back to old behavior? Or do we think that is a bad idea, and that we should enforce the same length now?

With this commit, if after a push_back the row count of the DataFrame would
be invalid, a warning is generated. The type is then not coerced to a
DataFrame, but left as a list.

In future, this could instead `stop()`.
Copy link
Member

@eddelbuettel eddelbuettel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Edit: Whoops, mean to comment/approve the last commit only. Oh well ;-)

@waltersom
Copy link
Contributor Author

I have implemented a function set_type_after_push(), which checks if any columns have a length that is different from 1, the maximum number of rows, or an integer factor of the max. In such a case, it doesn't call set__() (so wouldn't trigger an error), and does call warning(). It might be preferable at some point to change to stop().

But, thinking about it, let me check what happens if one of the columns has zero length...

@eddelbuettel
Copy link
Member

Actually, using warning() is good because there is an existing hok for 'warnings-as-errors' that already gives us the stop().

@eddelbuettel
Copy link
Member

I guess you noticed

 Error: ----- FAILED[data]: test_dataframe.R<112--112>
   call| expect_equal(DataFrame_PushReplicateLength(), df)
   diff| Names: 2 string mismatches 
  Execution halted

ChangeLog Outdated
@@ -17,6 +17,13 @@
* inst/NEWS.Rd: Idem
* vignettes/rmd/Rcpp.bib: Idem
* inst/bib/Rcpp.bib: Idem
2020-07-01 Walter Somerville <waltersom@gmail.com>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be

  • in chronological order
  • with proper empty lines

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just sent a trivial commit back to you for that so make sure you pull to your box.

This triggers the warning when adding a zero-length object to a
DataFrame. It is not a valid DataFrame, so calls `warning()`, and doesn't
call `as.data.frame()`.
@waltersom
Copy link
Contributor Author

Ok, I have corrected that failing test. Thanks for doing the changelog.

@Jean-Romain
Copy link

lidR maintainer here. The fix has been merged in master. The use of DataFrame was just a convenience to highlight the fact that all elements are the same length but the use of a List is valid as well.

Please keep me updated when you plan to release this change. I won't release a new version of lidR right now because I already made several releases recently. When you will be ready for a release of Rcpp I will release lidR ahead of time so this change won't break reverse dependency.

@eddelbuettel
Copy link
Member

Salut Jean-Romain

It works the other way around. We cannot upload until you upload as our now-stricter-in-this-PR code breaks your package, and CRAN will not admit "breaking" packages. We are not in a great hurry as we just released, and but the point also is to not have this linger. Could you update within, say, two weeks which is the common timeframe CRAN itself often gives?

@Jean-Romain
Copy link

Jean-Romain commented Jul 9, 2020

I know it works the other way. This is why I suggested to tell me when you plan to release Rcpp so I can release my package first. I believed that the common time-frame given by the CRAN was more like 1 month but I can release in two weeks if mandatory.

@eddelbuettel
Copy link
Member

eddelbuettel commented Jul 9, 2020

Rcpp will likely not have a release for many months [1], but that is no reason to push this off "forever". I apologize if we were unclear, we did not mean to put a gun to your head and scream "update NOW" but as we depend upon your cooperation here it would help us if you did it "sooner rather than later". Again, it does obviously not have to be this week. But everybody is busy and it helps to commit to a) action and b) a date. So if you could say "sure, but this is summer so give me three weeks" it would help us too.

[1] But planning is one thing. The possible need for a hot-fix bug release is always a possibility too.

@Jean-Romain
Copy link

No problem. It will definitively be done within a month maximum and I won't delay that forever.

@eddelbuettel
Copy link
Member

@waltersom the last commits of yours and mine got out of sync so you needed a merge -- I just cleaned that up with squash rebase and force-push. To catch you would need to reset your end a few commits and then pull.

@eddelbuettel
Copy link
Member

@Jean-Romain Thanks. If and when you do please let us know here so that we can proceed too.

@waltersom
Copy link
Contributor Author

waltersom commented Jul 10, 2020 via email

@eddelbuettel
Copy link
Member

eddelbuettel commented Jul 10, 2020

Just tested metacart 2.0-3, already on CRAN, and as expected it passes. Nice work!

@eddelbuettel
Copy link
Member

Hi @Jean-Romain

The last lidR release on CRAN was on July 5, which will be a month ago in a few days. Given that CRAN will close for the summer between August 14 and 24, I would really appreciate it if you could update lidR later this week. Respecting CRAN's wishes for updates only about once a month is important, but it would also be nice if you could help us let Rcpp move forward with the change from this PR.

@Jean-Romain
Copy link

The new version is now on CRAN.

@eddelbuettel
Copy link
Member

@Jean-Romain Thank you! I will proceed with Rcpp tests and changes then clearing the way for this PR.

@eddelbuettel
Copy link
Member

Thanks everybody for their patience, of course @waltersom for writing the PR in the first place, and to @jclaramunt for very promptly updating metacart, and to @Jean-Romain for doing the same with lidR (letting some time pass since its last update).

Reverse depends checks revealed no issue so this now goes into the main branch. Yay!!! And again, thanks to all for help!

@eddelbuettel eddelbuettel merged commit 05eea3d into RcppCore:master Aug 4, 2020
@eddelbuettel
Copy link
Member

Thanks again for everybody's patience. This is now 'live' in Rcpp 1.0.6 which just got to CRAN under our bi-annual release cycle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DataFrame converting to list on push_back with some c++ standards/compilers
4 participants