Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replaced some unique() calls in power_prune() with mean() #556

Conversation

phageghost
Copy link
Contributor

This ensures that iv.se will be a scalar not a vector even when the outcome.samplesize varies within outcome subsets.

The current behavior assumes that the samplesize is the same within each outcome subset, and that unique will therefore return a scalar. When this assumption is violated, it returns a vector of samplesize values that is likely to differ in length from the total size of the subset and will therefore produce a size mismatch when assigning it as a column in that data.frame.

Removing the unique() call and allowing iv.se to be a vector of the size of the outcome subset will not produce an error but it will cause the outcome subsets to be selected according to the largest samplesize among any of the IV SNPs in that subset, which is not the desired behavior.

Using the mean() of the samplesize ensures that iv.se is the same for every SNP in the outcome subset (and that the selection logic will therefore work properly) while taking into account all the SNPs in the subset.

@phageghost
Copy link
Contributor Author

phageghost commented Aug 29, 2024

I also applied this reasoning to the n.cas and n.con variables, but have not tested that aspect since I am dealing with a continuous outcome. Might want to check into the "method 1" approach as well, didn't test that either.

Copy link
Contributor

@mightyphil2000 mightyphil2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks phageghost! This looks like a good improvement. Thanks for flagging and fixing.

Copy link

codecov bot commented Sep 2, 2024

Codecov Report

Attention: Patch coverage is 0% with 5 lines in your changes missing coverage. Please review.

Project coverage is 38.70%. Comparing base (7976c28) to head (d7ecf52).
Report is 3 commits behind head on master.

Files with missing lines Patch % Lines
R/format_mr_results2.R 0.00% 5 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master     #556       +/-   ##
===========================================
- Coverage   48.81%   38.70%   -10.12%     
===========================================
  Files          31       31               
  Lines        5248     5247        -1     
===========================================
- Hits         2562     2031      -531     
- Misses       2686     3216      +530     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -341,14 +339,15 @@ power_prune <- function(dat,method=1,dist.outcome="binary")
r2sum<-sum(r2) # sum of the r-squares for each SNP in the instrument
# F<-r2sum*(n-1-k)/((1-r2sum*k )
if(dist.outcome == "continuous"){
iv.se<- 1/sqrt(unique(dat2$samplesize.outcome)*r2sum) #standard error of the IV should be proportional to this
iv.se<- 1/sqrt(mean(dat2$samplesize.outcome)*r2sum) #standard error of the IV should be proportional to this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would feel a bit safer to me if you could you add in na.rm = TRUE to the mean() call, i.e.,

iv.se<- 1/sqrt(mean(dat2$samplesize.outcome, na.rm = TRUE)*r2sum)

if(any(is.na(n.cas)) || any(is.na(n.con))) {
warning("dist.outcome set to binary but number of cases or controls is missing. Will try using total sample size instead but power pruning will be less accurate")
iv.se<- 1/sqrt(unique(dat2$samplesize.outcome)*r2sum)
}
iv.se<- 1/sqrt(mean(dat2$samplesize.outcome)*r2sum)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again I would add the na.rm = TRUE, also I wonder if we should round as well, i.e.,

iv.se<- 1/sqrt(round(mean(dat2$samplesize.outcome, na.rm = TRUE))*r2sum)

}
iv.se<- 1/sqrt(mean(dat2$samplesize.outcome)*r2sum)
} else {
iv.se<-1/sqrt(mean(n.cas)*mean(n.con)*r2sum) #standard error of the IV should be proportional to this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And same here, I would amend to

iv.se<-1/sqrt(round(mean(n.cas, na.rm = TRUE))*round(mean(n.con, na.rm = TRUE))*r2sum)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks Tom! I believe I've accepted those changes now. Did it work?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It didn't work Philip - I think you probably ticked check boxes to say you'd viewed the comment (or something like) - so I have just made a new release with the na.rm = TRUEs in.

@remlapmot
Copy link
Contributor

Thanks for this.

Don't worry about the codecov failures (the overall codecov % just depends upon how busy the OpenGWAS server is).

If you could make my suggested amends I think the code will be a bit safer.

@mightyphil2000 mightyphil2000 merged commit d70ec00 into MRCIEU:master Sep 3, 2024
9 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants