Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame.push_back() fails for zero-row columns #1232

Closed
jeroen opened this issue Oct 3, 2022 · 8 comments
Closed

DataFrame.push_back() fails for zero-row columns #1232

jeroen opened this issue Oct 3, 2022 · 8 comments

Comments

@jeroen
Copy link
Contributor

jeroen commented Oct 3, 2022

Downstream bug report: ropensci/pdftools#108
I think the bug is that DataFrame.push_back() gives an error when trying to add a column to a data frame with 0 rows:

DataFrame test (int n) {

  IntegerVector foo(n);
  CharacterVector bar(n);
  CharacterVector baz(n);

  Rcpp::DataFrame df = DataFrame::create(
    _["foo"] = foo,
    _["bar"] = bar
  );
  df.push_back(baz, "baz");
  return df;
}
test(0);
# Error in FUN(X[[i]], ...) : is.data.frame(df) is not TRUE
# In addition: Warning messages:
# 1: In poppler_pdf_data(loadfile(pdf), font_info, opw, upw) :
#   Column sizes are not equal in DataFrame::push_back, object degrading to List
@eddelbuettel
Copy link
Member

Thanks, I'll try to take a look, should be possible to cover that case.

On the other hand ... I think that case is known and is simply a 15-year old design want meaning you should

  • keep everything in Rcpp::List or individual column til you are done
  • with number of columns known you can call Rcpp::DataFrame::create

@eddelbuettel
Copy link
Member

This (actually complete) example works:

Code

#include <Rcpp/Lightest>

// [[Rcpp::export]]
Rcpp::List listtest (int n) {

  Rcpp::IntegerVector foo(n);
  Rcpp::CharacterVector bar(n);
  Rcpp::CharacterVector baz(n);

  Rcpp::List ll = Rcpp::List::create(Rcpp::Named("foo") = foo,
                                     Rcpp::Named("bar") = bar);
  ll.push_back(baz);
  return ll;
}

/*** R
listtest(0)
*/

Output

> Rcpp::sourceCpp("/tmp/issue1232.cpp")

> listtest(0)
$foo
integer(0)

$bar
character(0)

[[3]]
character(0)

> 

There is a convenience converter (that IIRC actually calls back into R) that can turn that object into a df too:

Added code

// [[Rcpp::export]]
Rcpp::DataFrame dftest (int n) {

  Rcpp::IntegerVector foo(n);
  Rcpp::CharacterVector bar(n);
  Rcpp::CharacterVector baz(n);

  Rcpp::List ll = Rcpp::List::create(Rcpp::Named("foo") = foo,
                                     Rcpp::Named("bar") = bar);
  ll.push_back(baz);
  return Rcpp::DataFrame(ll);
}

Output

> Rcpp::sourceCpp("/tmp/issue1232.cpp")

> #listtest(0)
> dftest(0)
[1] foo          bar          character.0.
<0 rows> (or 0-length row.names)
> 

So maybe the matter is 'merely' the missing name?

Can you live with this 'list til I return' variant?

@jeroen
Copy link
Contributor Author

jeroen commented Oct 4, 2022

Can you live with this 'list til I return' variant?

Yes I can do that, but what is the best way to pass stringsAsFactors = false for this case? Previously I had this:

https://github.com/ropensci/pdftools/blob/cf2099a56b873b43e2f542d9b05185baad5eda99/src/bindings.cpp#L237-L245

But then if we first store a list, the _["stringsAsFactors"] = false part needs to move to Rcpp::DataFrame(ll). Should I use Rcpp::DataFrame::create(ll, _["stringsAsFactors"] = false) ?

@eddelbuettel
Copy link
Member

eddelbuettel commented Oct 4, 2022

Not having tested this, the second should work.

That echos what I wrote earlier about collecting your dimensions and calling the instantiation at the last step. I am aware of a few packages doing just that, and setting stringsAsFactors that way.

@eddelbuettel
Copy link
Member

I am not fully convinced we want to or need to support the use case you have here but the following diff achieves it:

modified   inst/include/Rcpp/DataFrame.h
@@ -140,10 +140,12 @@ namespace Rcpp{
                     max_rows = Rf_xlength(*it);
                 }
             }
-            for (it = Parent::begin(); it != Parent::end(); ++it) {
-                if (Rf_xlength(*it) == 0 || ( Rf_xlength(*it) > 1 && max_rows % Rf_xlength(*it) != 0 )) {
-                    // We have a column that is not an integer fraction of the largest
-                    invalid_column_size = true;
+            if (max_rows > 0) {
+                for (it = Parent::begin(); it != Parent::end(); ++it) {
+                    if (Rf_xlength(*it) == 0 || ( Rf_xlength(*it) > 1 && max_rows % Rf_xlength(*it) != 0 )) {
+                        // We have a column that is not an integer fraction of the largest
+                        invalid_column_size = true;
+                    }
                 }
             }
             if (invalid_column_size) {

@Enchufa2
Copy link
Member

Enchufa2 commented Oct 4, 2022

Seems like a reasonable change to me to resemble what happens in R, and therefore to apply the principle of least surprise:

> foo <- integer(0)
> bar <- character(0)
> baz <- character(0)
> df <- data.frame(foo, bar)
> cbind(df, baz)
[1] foo bar baz
<0 rows> (or 0-length row.names)

@eddelbuettel
Copy link
Member

I agree. I am inclined to test it. And when you run the example by @jeroen under the diff above you even get a reasonable filled-in name (as push_back does not set one):

edd@rob:~/git/rcpp(master)$ Rscript -e 'Rcpp::sourceCpp("/tmp/issue1232.cpp")'                           
                                                                                                         
> origissue(0)                                                                                           
[1] foo          bar          character.0.                                                               
<0 rows> (or 0-length row.names)                                                                         
edd@rob:~/git/rcpp(master)$

@eddelbuettel
Copy link
Member

As expected, no issues in a reverse-dependency check. I just pushed the branch. I will a new unit test for good measure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants