Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review comments #5

Closed
LTLA opened this issue Oct 23, 2020 · 5 comments
Closed

Review comments #5

LTLA opened this issue Oct 23, 2020 · 5 comments

Comments

@LTLA
Copy link
Collaborator

LTLA commented Oct 23, 2020

As requested, I've gone through the package. Points in no particular order beyond when I encountered them.

Vignettes

  • Missing a Suggests: on ggpubr, used in the vignette.
  • "diagnosism" in MiDAS_tutorial.Rmd is presumably a typo.
  • A couple of blocks in the vignette have include=FALSE, which makes for weird reading in the compiled HTML as you are referring to code chunks and results that do not appear.
  • I don't understand the table for the conditional results from runMIDAS. The vignette says "for each step", but why is there only one table? The non-significant genes showed up in the previous table - where did they go?
  • Some words on "HLA association fine-mapping on amino acid level" would be helpful.
  • A bit of remaining work, e.g., "Add results summary and interpretation here." and a few author comments slipping through, e.g., "It should be relatively straightforward to add a more simple analysis type named hla_het, right? ".

R documentation

The @return is often very sparse:

#' @return Tibble containing analysis results.

This doesn't give a lot to go on. What are the rows? What are the columns?

R code

There is no clear reason for the MiDAS class to exist, given that it does not add any new slots to the MAE. Why not just use an MAE directly? This would make it easier to interoperate with other packages that produce MAEs; but more importantly, it would save you from having to explain to the user what your custom MiDAS class is.

I see you have overloaded as.data.frame for MiDAS class objects. I would not consider this a good idea; I don't have a good mental model for what it means to coerce an MAE-class object into a data.frame. If you need to make a wide DF, you should use a function that is explicitly named as such, see, for example, MultiAssayExperiment::wideFormat().

What's with all the S3 logic in MiDAS_transformationFunctions.R? Mixing S3 and S4 is generally not a good idea. Seems like you could make your life much simpler by just writing a regular function and having a little clause at the top to check whether you're dealing with a matrix or an SE, extracting the former from the assays of the latter if necessary. If you're not expecting other people to extend your methods, there's really no need to make them generics.

I would also say that most of your dependencies are unnecessary, in particular all the tidyverse stuff, and you could make the package leaner and more performant with base R.

Don't understand the point of global.R, your objects seem to be perfectly accessible without it.

Tests

Not all of them pass on my machine.

  Runningtestthat.R’ [452s/458s]
 ERROR
Running the tests intests/testthat.Rfailed.
Last 13 lines of output:
  Component "statistic": 'is.NA' value mismatch: 0 in current 1 in target
  Component "p.value": 'is.NA' value mismatch: 0 in current 1 in target
  Component "conf.low": 'is.NA' value mismatch: 0 in current 1 in target
  Component "conf.high": 'is.NA' value mismatch: 0 in current 1 in target

  ══ testthat results  ═══════════════════════════════════════════════════════════
  [ OK: 351 | SKIPPED: 0 | WARNINGS: 23 | FAILED: 5 ]
  1. Failure: filterExperimentByFrequency (@test_MiDAS_transformation.R#443)
  2. Failure: applyInheritanceModel (@test_MiDAS_transformation.R#688)
  3. Failure: LRTest (@test_MiDAS_utils.R#188)
  4. Failure: midasToWide (@test_MiDAS_utils.R#354)
  5. Failure: iterativeModel (@test_MiDAS_utils.R#432)
@lawremi
Copy link
Collaborator

lawremi commented Oct 23, 2020

Thanks a lot for your review @LTLA

@chammer80
Copy link
Collaborator

Thanks a lot for the review, @LTLA. @mcjmigdal will reply to the more technical comments, here's some feedback on the vignette:

  • Thanks for noticing the typos, comments, and not included code chunks. The only code chunks I don't echo are the ones to output generated tables and figures, to direct focus to the package functions. I've corrected all of that and added a bit more explanations where suggested.
  • The "conditional" flag runs step-wise conditional analyses, always adding the respective top associated variable to the results table for each iteration, until a defined P value threshold is reached. This is to find out how many statistically independent signals there are in the data. That said, there is also a "keep" flag, explained in the "runMiDAS" description, allowing you to keep the complete results tables for each iteration.

@mcjmigdal
Copy link
Collaborator

mcjmigdal commented Oct 28, 2020

@LTLA thank you for the revision and some insightful comments :) I've created issues to the most obvious ones.

Vignettes
Thank you for your suggestions and sorry for some unfinished parts. We will update the vignettes and the description #7, #8

I don't understand the table for the conditional results from runMIDAS. The vignette says "for each step", but why is there only one table? The non-significant genes showed up in the previous table - where did they go?

Answered by Christian

R documentation

👍 #9

R code

There is no clear reason for the MiDAS class to exist, given that it does not add any new slots to the MAE. Why not just use an MAE directly? This would make it easier to interoperate with other packages that produce MAEs; but more importantly, it would save you from having to explain to the user what your custom MiDAS class is.

That was to focus those methods more on the problem of HLA/KIR data. Although, most of the concepts we are using in our analyses could be easily extended to other types of data, we used some assumptions that holds in the frame of the package. One example is that experiments can have only 1 assay, or that the rownames cannot be duplicated. One of the functions that heavily takes advantage of those assumptions is the as.data.frame function, which you refereed to in your next comment.

I see you have overloaded as.data.frame for MiDAS class objects. I would not consider this a good idea; I don't have a good mental model for what it means to coerce an MAE-class object into a data.frame. If you need to make a wide DF, you should use a function that is explicitly named as such, see, for example, MultiAssayExperiment::wideFormat().

In the package the use of as.data.frame.MiDAS is dictated by what R's statistical models expects as an input data format. Conveniently enough most of the models (like lm) allow to pass any object as data argument and will coerce it to data.frame if there exist specific method. This allows us to do stuff like lm(Y ~ X, data=MiDAS_obj). Such approach allowed us to work with different models and settings.
You might have also noticed that as.data.frame.MiDAS is loosely based on MultiAssayExperiment::wideFormat(). Unfortunately, we were unsuccessful at using directly MultiAssayExperiment::wideFormat() due to it slownes.

What's with all the S3 logic in MiDAS_transformationFunctions.R? Mixing S3 and S4 is generally not a good idea. Seems like you could make your life much simpler by just writing a regular function and having a little clause at the top to check whether you're dealing with a matrix or an SE, extracting the former from the assays of the latter if necessary. If you're not expecting other people to extend your methods, there's really no need to make them generics.

Why it is not a good idea? 😅
I simply find it convenient, development wise it was also useful to be able to extend those methods easily.

I would also say that most of your dependencies are unnecessary, in particular all the tidyverse stuff, and you could make the package leaner and more performant with base R.

Hard to disagree, however at this point it would be very costly to go back to base R. EDIT: However, the use of tidy functions from broom packages is critical as we are using it to get standardized outputs from statistical models. This actually allows us to be flexible when it comes to model choice. As tidy output is tibble, the use of tidyverse stuff is to some extend explained. I still do agree that it could be much reduced, but it would be very costly at this point.

Don't understand the point of global.R, your objects seem to be perfectly accessible without it.

That is really only to prevent warnings on devtools::check like

distGrantham: no visible binding for global variable
‘dict_dist_grantham’
getFrequencies: no visible binding for global variable

@LTLA
Copy link
Collaborator Author

LTLA commented Oct 29, 2020

Although, most of the concepts we are using in our analyses could be easily extended to other types of data, we used some assumptions that holds in the frame of the package.

Sounds restrictive. A better approach would be for your functions to handle these situations gracefully rather than coercing all users to use a custom class. Typical behaviors would be to emit warnings on duplicated row names, and then taking the first instance; or providing options to choose the assay to be used from each experiment, either as a string or integer scalar.

Unnecessary subclassing really breaks interoperability. Imagine a situation where someone else writes another package for this type of dataset, and they make their own MAE subclass; then anyone who wants to use both packages will have to convert from one class to another, possibly lossily. If your package can accept an MAE, you'll never be the offending party in any exchange.

This allows us to do stuff like lm(Y ~ X, data=MiDAS_obj). Such approach allowed us to work with different models and settings.

Why not just lm(Y ~ X, data=midasToWide(mae))? Nice and explicit, no room for ambiguity, no need for readers to wonder what it really means for a MAE to be converted into a data.frame. (For example, to me, the most intuitive as.data.frame method for MAEs would just be equivalent to colData(mae), rather than any wide-format conversion.) A few extra characters but ease of code comprehension beats ease of writing any day of the week. And if ease of typing is really a major concern to you, you shouldn't have a package name where there's a fluctuation in the case, which is a real pain to type correctly.

Why it is not a good idea?

Because it'll make your maintenance programmer think that there's something else going on under the hood. I only mix S3 and S4 when I absolutely have to, e.g., because I want my class object to be used in functions that use S3 dispatch. Typical examples would be DelayedArray::as.matrix.Array or S4Vectors::as.data.frame.Hits. In your case, though, there is no need for this, and it introduces an unnecessary extra thing for a future programmer to worry about, namely, "there must be a reason why they're using S3 methods with S4 classes, rather than using pure S4 or ordinary functions. Why is that?"

You would make your maintainer's life easier by either using pure S4 or, TBH, just making an ordinary function with a check for the class of the incoming argument. If you're talking about developer simplicity, you can't really beat the latter.

However, the use of tidy functions from broom packages is critical as we are using it to get standardized outputs from statistical models

I doubt it's that critical, I and many others get along with base R just fine.

@mcjmigdal
Copy link
Collaborator

Although, most of the concepts we are using in our analyses could be easily extended to other types of data, we used some assumptions that holds in the frame of the package.

Sounds restrictive. A better approach would be for your functions to handle these situations gracefully rather than coercing all users to use a custom class. Typical behaviors would be to emit warnings on duplicated row names, and then taking the first instance; or providing options to choose the assay to be used from each experiment, either as a string or integer scalar.

Unnecessary subclassing really breaks interoperability. Imagine a situation where someone else writes another package for this type of dataset, and they make their own MAE subclass; then anyone who wants to use both packages will have to convert from one class to another, possibly lossily. If your package can accept an MAE, you'll never be the offending party in any exchange.

Except object creation and filtration, all of the functionalities outputs a data.frame or a list. Then after creating MiDAS, being MAE child it is still usable to functions handling MAE. So interoperability is maintained, perhaps a nice addition would be a method for MAE -> MiDAS conversion.

This allows us to do stuff like lm(Y ~ X, data=MiDAS_obj). Such approach allowed us to work with different models and settings.

Why not just lm(Y ~ X, data=midasToWide(mae))? Nice and explicit, no room for ambiguity, no need for readers to wonder what it really means for a MAE to be converted into a data.frame. (For example, to me, the most intuitive as.data.frame method for MAEs would just be equivalent to colData(mae), rather than any wide-format conversion.) A few extra characters but ease of code comprehension beats ease of writing any day of the week. And if ease of typing is really a major concern to you, you shouldn't have a package name where there's a fluctuation in the case, which is a real pain to type correctly.

I would say it is not the ease of typing we are after but readability of the code, as model definition is integral part of our workflow it seem natural to have as.data.frame method. What also differs this function from *toWide, is that the result data frame must contain more complete description of parent MAE (including information on experiments). This transformation is one of the assumptions we take in MiDAS class, so that we are able to easily create statistical models out of it.
Concept of lm(Y ~ X, data=toWide(MAE)) is interesting IMO, but practically applicable only to subset of MAE that have only one assay per experiment and no duplicated rownames (MiDAS assumptions).

Why it is not a good idea?

Because it'll make your maintenance programmer think that there's something else going on under the hood. I only mix S3 and S4 when I absolutely have to, e.g., because I want my class object to be used in functions that use S3 dispatch. Typical examples would be DelayedArray::as.matrix.Array or S4Vectors::as.data.frame.Hits. In your case, though, there is no need for this, and it introduces an unnecessary extra thing for a future programmer to worry about, namely, "there must be a reason why they're using S3 methods with S4 classes, rather than using pure S4 or ordinary functions. Why is that?"

You would make your maintainer's life easier by either using pure S4 or, TBH, just making an ordinary function with a check for the class of the incoming argument. If you're talking about developer simplicity, you can't really beat the latter.

I doubt it's that confusing for maintenance programmers. The as.data.frame method is the only S3 method I've defined for MiDAS, the other S3 methods are defined for matrixces and SummarizedExperiments. By using S3 mechanism it is now easy to accommodate if addition of new experiment would be better handled by say DelayedArray.

However, the use of tidy functions from broom packages is critical as we are using it to get standardized outputs from statistical models

I doubt it's that critical, I and many others get along with base R just fine.

In our use cases of MiDAS we are commonly using at least 4 statistical models (lm, glm, coxph, coxme). One way to handle this problem would be to handle each of these models separately, possibly in base R. The use of tidy gives us easy way to do it, at the same time we can take the advantage of the many other models that tidy handles. We could not beat that ourselves implementing dedicated methods to each different statistical model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants