Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Naming support #197

Closed
wants to merge 1 commit into from
Closed

Naming support #197

wants to merge 1 commit into from

Conversation

yaccos
Copy link
Contributor

@yaccos yaccos commented Dec 10, 2020

As we dicussed yesterday at issue #190, you were worried that mimicing the behavior of base::rowSums, returning the rownames in the results would result in a performance overhead. I have now spent the last two days resolving the problem, and created an implementation touching only the C-code. At the time being, I have only implemented the feature for rowSums2 (and consequently for colSums2), but expanding the solution to the rest of the family should be straight-forward. I have tested the package and modified the tests to check whether the names are correct. I leave it to you to benchmark the modifications, but here is what I would expect to happen:

  • For unnamed arguments the modified version should barely give any overhead at all. This is due to the fact that the only overhead is a quick check of whether a relevant dimname of the input exists.
  • For named arguments which pick all indecies among the leading dimension (rows for rowSums2 and columns for colSums2), we can more or less copy the pointer for the name attribute directly, reducing the overhead to a small and constant number of instructions.
  • For named arguments which pick a subset of the indecies among the leading dimension, we will have overhead from iterating over these indecies and subsequent extraction and casting. Still, I think the overhead will we slight.

Also, I notice that the package installed size of the package exceeds the recommended limit of 10 MB and that some of the function factory macros are hard to read. You could migitate this problem by writing a custom version of validateIndecies (in the C version of course) which always coerse the data under ansNidxs to R_xlen_t in the case of SUBSETTED_INTEGER and SUBSETTED_REAL, leaving the subsettedType parameter superflous. For functions like rowSums2, the coersion to R_xlen_t must be done anyway. In the case of SUBSETTED_ALL, a NULL pointer is returned, making it easy to deal with this case separately.

@yaccos
Copy link
Contributor Author

yaccos commented Dec 10, 2020

Sorry, I forgot to remove some debugging print statements in naming.c, so you know were the strange output comes from.

@yaccos
Copy link
Contributor Author

yaccos commented Feb 3, 2021

Have you had time to review my solution yet? It was primarly meant as an inspiration to how the API could deal with naming in an efficient manner. Also if it is important to reduce overhead by calling R-code, I could advise that you replace dim. <- as.integer(dim.) and na.rm <- as.logical(na.rm) in the R-code with calls to coerceVector in the C-code.

@HenrikBengtsson
Copy link
Owner

Hi @yaccos, sorry for the radio silence. I've just got too many things on my plate and earth has this habit of doing one revolution every 24 hours - just too fast. Thank you so much for this proof-of-concept PR and for spending time on this problem.

I agree it would be useful if the matrixStats API could provide consistent handling of name attributes. I'm worried that rolling it out will disruptive and that's why this apparently "straightforward" task has been procrastinated for years now. I've spent a couple of days thinking about and writing up a plan for rolling it out safely; the gist is to roll it out with 100% backward compatibility to today's inconsistent behavior, while still making it available for users and developers to start making use of it. This way we can punt on the problem of switch the default behavior.

Coincidentally, I think it is a good candidate project for R-GSOC 2021, so I wrote up the plan with that in mind. See https://github.com/HenrikBengtsson/matrixStats/wiki/GSOC-2021-Proposal for how I think about approaching this problem. What do you think? If you think this is a good idea, let's have a chat offline (email or something).

@yaccos
Copy link
Contributor Author

yaccos commented Feb 13, 2021

Yes, I understand very well that time and effort is a limited resource, especially when being an academic and not being payed to develop and maintain R packages. As popular and useful as the matrixStats package is, I consider it sad that there are lacking sufficient resources to maintain it. I therefore support your suggestion of including the task of naming consitency for R-GSOC 2021, and will send you an email in a few days to clearify the details on my views.

@SebKrantz
Copy link

Hello both of you, I‘d like to know the status of this. I‘d like to move forward to publish a package that implements this very soon (sooner than end of the summer or end of the year). I think this solution is pretty good, but if it does not live up to your demands Henrik or you don't have time to swiftly review it if I apply this to the rest of the functions I‘ll be happy to publish matrixStats2. From yaccos I‘l like to inquire if you have made any further progress on this or thought about publishing a package in the meantime that implements it. Best regards, Sebastian.

@HenrikBengtsson HenrikBengtsson added the names-attribute Involves names(), rownames(), and colnames() of results label May 21, 2021
@HenrikBengtsson
Copy link
Owner

For the record: Support for useNames is now available in matrixStats (>= 0.60.0). This is part of a GSOC 2021 project. It is currently implemented in R and work on implementing this in C is on its way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
names-attribute Involves names(), rownames(), and colnames() of results
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants