Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fwrite: avoid converting matrix input to data.table #3188

Open
mattdowle opened this issue Dec 6, 2018 · 7 comments · May be fixed by #4635
Open

fwrite: avoid converting matrix input to data.table #3188

mattdowle opened this issue Dec 6, 2018 · 7 comments · May be fixed by #4635
Labels
Milestone

Comments

@mattdowle
Copy link
Member

mattdowle commented Dec 6, 2018

PR #3125 implemented #2613 to enable to fwrite to accept matrix. This was merged in v1.12.0. However it converts matrix to data.table which can be costly.
The convert can be avoided as follows.
fwriteR.c contains this :

for (int j=0; j<args.ncol; j++) {
    SEXP column = VECTOR_ELT(DFcoerced, j);
    if (args.nrow != length(column))
      error("Column %d's length (%d) is not the same as column 1's length (%d)", j+1, length(column), args.nrow);
    int32_t wf = whichWriter(column);
    if (wf<0) {
      error("Column %d's type is '%s' - not yet implemented in fwrite.", j+1, type2char(TYPEOF(column)));
    }
    args.columns[j] = (wf==WF_CategString ? column : (void *)DATAPTR(column));
    args.whichFun[j] = (uint8_t)wf;
    if (TYPEOF(column)==VECSXP && firstListColumn==0) firstListColumn = j+1;
  }

This loop populates the pointers in arg.columns[] which are then passed to fwrite.c. That DATAPTR returns a pointer to where the data (e.g. int *, double *) for the R vector starts. In the case of matrix, this args.columns[] just needs to be populated with offsets into matrix vector. In R a matrix is a single very long vector just with dimension attribute attached. A matrix in R is columnar, just like a data.table, so there doesn't need to be a transpose. It should be fairly simple to do and work well.

Careful to add tests for a factor matrix as well as a character matrix.

I described how to do it in case @fparages or @MichaelChirico wanted to give it a go.

@mattdowle mattdowle added this to the 1.12.0 milestone Dec 6, 2018
@MichaelChirico
Copy link
Member

MichaelChirico commented Dec 6, 2018

Agree the as.data.table approach is a bit of a have and this is right way to do things.

How does one point to matrix column names in C API?

Don't think I'll have time before next year to work on something that for me & my lack of C experience will end up being quite time consuming (only mentioning because I see the 1.12.0 milestone and know you mentioned maybe pushing out soon given the memory fixes)

@mattdowle
Copy link
Member Author

mattdowle commented Dec 6, 2018

Sure thing.

args.columns[j] = (char *)DATAPTR(matrix) + j*size*nrow;

where size is 4 for int and 8 for double, for example.

@jangorecki
Copy link
Member

instead of DATAPTR isn't it more appropriate to use switch(TYPEOF(matrix)) ... INTEGER(matrix) ... REAL(matrix)?

@mattdowle
Copy link
Member Author

mattdowle commented Dec 6, 2018

It's more convenient and appropriate to use DATAPTR. But we're not supposed to is more the point because it's not part of the R API. I wish it were and I don't see why it isn't. It should be the switch() way to be compliant, yes.
We could create our own dataptr() function that does the switch. Seems silly to have to do so though when DATAPTR could just be added to the R API. There must be a good reason why it isnt.
Maybe now that INTEGER, REAL and DATAPTR are no longer ever macros, even under USE_RINTERNALS, the original reason for not exposing DATAPTR in R API might have gone away.

@mattdowle mattdowle modified the milestones: 1.12.0, 1.12.2 Jan 6, 2019
@raneameya
Copy link

I'd like to draw attention to the fact that matrices can have rownames which in the current version are stripped away in the as.data.table conversion. May I please ask that rownames be written as the first column by default if it is sensible? If not, maybe an additional argument to include rownames, like as.data.table?

@mattdowle
Copy link
Member Author

mattdowle commented Jan 23, 2019

Re rownames: sure thing. If rownames are the default 1:nrow then can we assume please they aren't really rownames and not write them by default? Internally R uses c(NA,INT_MIN) to represent default row names efficiently, so it's safe and efficient to test that attribute to see if row.names is a length 2 integer vector holding those values. With an argument to explicitly specify row names too of course; we're just talking about the default here.

@raneameya
Copy link

Thanks Matt!

@jangorecki jangorecki modified the milestones: 1.12.2, 1.12.4 Jan 24, 2019
@jangorecki jangorecki modified the milestones: 1.12.4, 1.13.0 Jul 25, 2019
@mattdowle mattdowle modified the milestones: 1.12.7, 1.12.9 Dec 8, 2019
@MichaelChirico MichaelChirico linked a pull request Jul 23, 2020 that will close this issue
@mattdowle mattdowle modified the milestones: 1.13.1, 1.13.3 Oct 17, 2020
@jangorecki jangorecki modified the milestones: 1.14.3, 1.14.5 Jul 19, 2022
@jangorecki jangorecki modified the milestones: 1.14.11, 1.15.1 Oct 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants