Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

emptyDrops does not terminate #14

Closed
s2hui opened this issue May 27, 2019 · 12 comments
Closed

emptyDrops does not terminate #14

s2hui opened this issue May 27, 2019 · 12 comments

Comments

@s2hui
Copy link

s2hui commented May 27, 2019

Hello,

I just upgraded to Bioconductor version 3.9.
When I run emptyDrops() on my count data it does not terminate (i.e. endless loop? goes on for hours). I have tried changing niters to be lower (i.e 100) but nothing seems to work.

When I revert back to Bioconductor version 3.8, emptyDrops works and terminates fine.

There are few emptyDrops detected overall (i.e. 5) but I would still like to know why it doesn't terminate when using the latest version. My count data is ~33K genes x ~1500 cells

Thanks
shui

@LTLA
Copy link
Collaborator

LTLA commented May 27, 2019

You don't provide the call to emptyDrops(), or even the sessionInfo(). There's not enough information for me to make an educated diagnosis. So instead I'll just use my psychic powers guess as to what the problem might be:

  • The only major change in emptyDrops() between DropletUtils 1.2.0 and 1.4.0 was to switch the random number generator. This should not be responsible for an infinite loop.
  • If you are on a Mac, running out of memory will typically cause a process to run very slowly without affecting anything else. By comparison, insufficient memory is quite obvious on Linux where the entire computer usually freezes.
  • Again, if you are on a Mac and you are using BPPARAM=MulticoreParam(), this can occasionally result in stalling.

@s2hui
Copy link
Author

s2hui commented May 27, 2019

The output for sessionInfo() currently is:

R version 3.6.0 (2019-04-26)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.4

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/Current/Resources/lib/libRlapack.dylib

Random number generation:
RNG: Mersenne-Twister
Normal: Inversion
Sample: Rounding

locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base

other attached packages:
[1] DropletUtils_1.4.0 Matrix_1.2-17 dplyr_0.8.1
[4] Seurat_3.0.1 scran_1.12.0 EnsDb.Hsapiens.v86_2.99.0
[7] ensembldb_2.8.0 AnnotationFilter_1.8.0 GenomicFeatures_1.36.1
[10] AnnotationDbi_1.46.0 scater_1.12.2 ggplot2_3.1.1
[13] SingleCellExperiment_1.6.0 SummarizedExperiment_1.14.0 DelayedArray_0.10.0
[16] BiocParallel_1.18.0 matrixStats_0.54.0 Biobase_2.44.0
[19] GenomicRanges_1.36.0 GenomeInfoDb_1.20.0 IRanges_2.18.0
[22] S4Vectors_0.22.0 BiocGenerics_0.30.0

loaded via a namespace (and not attached):
[1] plyr_1.8.4 igraph_1.2.4.1 lazyeval_0.2.2
[4] splines_3.6.0 listenv_0.7.0 digest_0.6.19
[7] htmltools_0.3.6 viridis_0.5.1 gdata_2.18.0
[10] magrittr_1.5 memoise_1.1.0 cluster_2.0.9
[13] ROCR_1.0-7 limma_3.40.2 globals_0.12.4
[16] Biostrings_2.52.0 R.utils_2.8.0 prettyunits_1.0.2
[19] colorspace_1.4-1 blob_1.1.1 ggrepel_0.8.1
[22] crayon_1.3.4 RCurl_1.95-4.12 jsonlite_1.6
[25] survival_2.44-1.1 zoo_1.8-5 ape_5.3
[28] glue_1.3.1 gtable_0.3.0 zlibbioc_1.30.0
[31] XVector_0.24.0 BiocSingular_1.0.0 Rhdf5lib_1.6.0
[34] future.apply_1.2.0 HDF5Array_1.12.1 scales_1.0.0
[37] DBI_1.0.0 edgeR_3.26.3 bibtex_0.4.2
[40] Rcpp_1.0.1 metap_1.1 viridisLite_0.3.0
[43] progress_1.2.2 reticulate_1.12 dqrng_0.2.1
[46] bit_1.1-14 rsvd_1.0.0 SDMTools_1.1-221.1
[49] tsne_0.1-3 htmlwidgets_1.3 httr_1.4.0
[52] gplots_3.0.1.1 RColorBrewer_1.1-2 ica_1.0-2
[55] pkgconfig_2.0.2 XML_3.98-1.19 R.methodsS3_1.7.1
[58] locfit_1.5-9.1 dynamicTreeCut_1.63-1 labeling_0.3
[61] reshape2_1.4.3 tidyselect_0.2.5 rlang_0.3.4
[64] munsell_0.5.0 tools_3.6.0 RSQLite_2.1.1
[67] ggridges_0.5.1 stringr_1.4.0 npsurv_0.4-0
[70] bit64_0.9-7 fitdistrplus_1.0-14 caTools_1.17.1.2
[73] purrr_0.3.2 RANN_2.6.1 pbapply_1.4-0
[76] future_1.13.0 nlme_3.1-140 R.oo_1.22.0
[79] biomaRt_2.40.0 compiler_3.6.0 beeswarm_0.2.3
[82] plotly_4.9.0 curl_3.3 png_0.1-7
[85] lsei_1.2-0 tibble_2.1.1 statmod_1.4.30
[88] stringi_1.4.3 lattice_0.20-38 ProtGenerics_1.16.0
[91] pillar_1.4.0 Rdpack_0.11-0 lmtest_0.9-37
[94] BiocNeighbors_1.2.0 data.table_1.12.2 cowplot_0.9.4
[97] bitops_1.0-6 irlba_2.3.3 gbRd_0.4-11
[100] rtracklayer_1.44.0 R6_2.4.0 KernSmooth_2.23-15
[103] gridExtra_2.3 vipor_0.4.5 codetools_0.2-16
[106] MASS_7.3-51.4 gtools_3.8.1 assertthat_0.2.1
[109] rhdf5_2.28.0 withr_2.1.2 sctransform_0.2.0
[112] GenomicAlignments_1.20.0 Rsamtools_2.0.0 GenomeInfoDbData_1.2.1
[115] hms_0.4.2 grid_3.6.0 tidyr_0.8.3
[118] DelayedMatrixStats_1.6.0 Rtsne_0.15 ggbeeswarm_0.6.0

@s2hui
Copy link
Author

s2hui commented May 27, 2019

And the call to emptyDrops is:

emptyDrops(counts(sce))

@LTLA
Copy link
Collaborator

LTLA commented May 27, 2019

I don't have much insight to give here. Your call and session information look fine to me. I have no problems running emptyDrops on a Mac, and it works fine on the Bioconductor build system.

@s2hui
Copy link
Author

s2hui commented May 28, 2019 via email

@LTLA
Copy link
Collaborator

LTLA commented May 29, 2019

I just noticed that your count matrix only has 1500 cells. You should really be running it on the raw count matrix prior to any cell calling; for a 10X experiment, this should have ~750,000 barcodes. emptyDrops() will not yield sensible results if you're calling it on barcodes that have already been filtered!

So the real solution is to use the raw count matrix as input into emptyDrops(). Nonetheless, the lack of termination may be a symptom of another problem, so if you can send me the offending count matrix, I can have a look at it. (Best to put it on dropbox or somewhere, rather than trying to email a large file.)

@s2hui
Copy link
Author

s2hui commented May 30, 2019

Hello,

Yes it has a small number of cells as I am QCing per batch and batch 5 only has 1588 cells?? The call to emptyDrops is the first call I make in QCing so prior to the call it is simply the raw count data loaded in.

The RData object is here:
https://drive.google.com/open?id=1w3LVafFU8Bw3NYSLVhHsGuKD_VBx0ktl

When I run the following with Bioconductor 3.8 (in 3.9 the call to emptyDrops hangs)

set.seed(100)
fdrCutoff = 0.01
e.out <- emptyDrops(counts(sce.i))
sum(e.out$FDR <= fdrCutoff, na.rm=TRUE)
print(paste0("Number empty drops: ",length(which(e.out$FDR > fdrCutoff)), " out of " , dim(sce.i)[2]))

[1] "Number empty drops: 1212 out of 1588"

counts(sce.i)
33694 x 1588 sparse Matrix of class "dgCMatrix"

sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS 10.14.4

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods
[9] base

other attached packages:
[1] DropletUtils_1.2.2 dplyr_0.8.1
[3] scran_1.10.2 EnsDb.Hsapiens.v86_2.99.0
[5] ensembldb_2.6.8 AnnotationFilter_1.6.0
[7] GenomicFeatures_1.34.8 AnnotationDbi_1.44.0
[9] scater_1.10.1 Seurat_2.3.4
[11] Matrix_1.2-17 cowplot_0.9.4
[13] ggplot2_3.1.1 SingleCellExperiment_1.4.1
[15] SummarizedExperiment_1.12.0 DelayedArray_0.8.0
[17] BiocParallel_1.16.6 matrixStats_0.54.0
[19] Biobase_2.42.0 GenomicRanges_1.34.0
[21] GenomeInfoDb_1.18.2 IRanges_2.16.0
[23] S4Vectors_0.20.1 BiocGenerics_0.28.0

loaded via a namespace (and not attached):
[1] snow_0.4-3 backports_1.1.4 Hmisc_4.2-0
[4] plyr_1.8.4 igraph_1.2.4.1 lazyeval_0.2.2
[7] splines_3.5.0 digest_0.6.19 foreach_1.4.4
[10] htmltools_0.3.6 viridis_0.5.1 lars_1.2
[13] gdata_2.18.0 memoise_1.1.0 magrittr_1.5
[16] checkmate_1.9.3 cluster_2.0.9 mixtools_1.1.0
[19] ROCR_1.0-7 limma_3.38.3 Biostrings_2.50.2
[22] R.utils_2.8.0 prettyunits_1.0.2 colorspace_1.4-1
[25] blob_1.1.1 xfun_0.7 crayon_1.3.4
[28] RCurl_1.95-4.12 jsonlite_1.6 survival_2.44-1.1
[31] zoo_1.8-6 iterators_1.0.10 ape_5.3
[34] glue_1.3.1 gtable_0.3.0 zlibbioc_1.28.0
[37] XVector_0.22.0 kernlab_0.9-27 Rhdf5lib_1.4.3
[40] prabclus_2.2-7.1 DEoptimR_1.0-8 HDF5Array_1.10.1
[43] scales_1.0.0 mvtnorm_1.0-10 edgeR_3.24.3
[46] DBI_1.0.0 bibtex_0.4.2 Rcpp_1.0.1
[49] metap_1.1 dtw_1.20-1 progress_1.2.2
[52] viridisLite_0.3.0 htmlTable_1.13.1 reticulate_1.12
[55] foreign_0.8-71 bit_1.1-14 proxy_0.4-23
[58] mclust_5.4.3 SDMTools_1.1-221.1 Formula_1.2-3
[61] tsne_0.1-3 htmlwidgets_1.3 httr_1.4.0
[64] gplots_3.0.1.1 RColorBrewer_1.1-2 fpc_2.2-1
[67] acepack_1.4.1 modeltools_0.2-22 ica_1.0-2
[70] XML_3.98-1.19 pkgconfig_2.0.2 R.methodsS3_1.7.1
[73] flexmix_2.3-15 nnet_7.3-12 locfit_1.5-9.1
[76] dynamicTreeCut_1.63-1 tidyselect_0.2.5 rlang_0.3.4
[79] reshape2_1.4.3 munsell_0.5.0 tools_3.5.0
[82] RSQLite_2.1.1 ggridges_0.5.1 stringr_1.4.0
[85] yaml_2.2.0 npsurv_0.4-0 knitr_1.23
[88] bit64_0.9-7 fitdistrplus_1.0-14 robustbase_0.93-5
[91] caTools_1.17.1.2 purrr_0.3.2 RANN_2.6.1
[94] pbapply_1.4-0 nlme_3.1-140 R.oo_1.22.0
[97] biomaRt_2.38.0 hdf5r_1.2.0 compiler_3.5.0
[100] rstudioapi_0.10 curl_3.3 beeswarm_0.2.3
[103] png_0.1-7 lsei_1.2-0 statmod_1.4.32
[106] tibble_2.1.2 stringi_1.4.3 lattice_0.20-38
[109] ProtGenerics_1.14.0 pillar_1.4.1 BiocManager_1.30.4
[112] Rdpack_0.11-0 lmtest_0.9-37 BiocNeighbors_1.0.0
[115] data.table_1.12.2 bitops_1.0-6 irlba_2.3.3
[118] gbRd_0.4-11 rtracklayer_1.42.2 R6_2.4.0
[121] latticeExtra_0.6-28 KernSmooth_2.23-15 gridExtra_2.3
[124] vipor_0.4.5 codetools_0.2-16 MASS_7.3-51.4
[127] gtools_3.8.1 assertthat_0.2.1 rhdf5_2.26.2
[130] withr_2.1.2 GenomicAlignments_1.18.1 Rsamtools_1.34.1
[133] GenomeInfoDbData_1.2.0 hms_0.4.2 diptest_0.75-7
[136] doSNOW_1.0.16 grid_3.5.0 rpart_4.1-15
[139] tidyr_0.8.3 class_7.3-15 DelayedMatrixStats_1.4.0
[142] segmented_0.5-4.0 Rtsne_0.15 base64enc_0.1-3
[145] ggbeeswarm_0.6.0

@LTLA
Copy link
Collaborator

LTLA commented May 31, 2019

I just want the count matrix that you're putting into emptyDrops. Surely the count matrix is not 2.3 GB?

@LTLA
Copy link
Collaborator

LTLA commented May 31, 2019

Well, your Rdata file doesn't even have an sce.i object in it. It's just got a giant whopping Seurat object named so. I'll assume you're trying to run emptyDrops on so@raw_data, and I can more-or-less reproduce the lack of termination with:

load("so.qc.normsize.lognorm.scale.pca.tsne.res1.demarkers.12746.RCCTumour.RData")
library(DropletUtils)
system.time(out <- emptyDrops(so@raw.data)) 

The cause of this phenomenon is technically complex. For those who are interested, it seems that Boost's uniform_distribution class really doesn't like it when the maximum and minimum values are the same. This causes an infinite loop, presumably somewhere inside the class. I don't know whether this is a known behaviour or not, certainly the docs suggest that max can be equal to min.

However, the exact cause is irrelevant. Regardless of whether it terminates or not, what you are doing is almost certainly wrong. Assuming this is 10X data, you need to be giving emptyDrops the RAW, UNFILTERED count matrix that comes out of CellRanger. This should have about 750000 columns, not just 12000. You are currently trying to detect empty droplets in a count matrix that has already been filtered to remove empty droplets; the results will be nonsensical.

In fact, the real question is why the function ever worked in the first place - because it shouldn't have. (The problem with Boost only arises because the sum of ambient counts is zero.) So, I will add extra error checks to provide some defense against this, whereby the function will simply quit with an error if there are no low-count libraries available to estimate the ambient profile.

@s2hui
Copy link
Author

s2hui commented May 31, 2019

My apologies, the RData is actually here (I sent the wrong link):
https://drive.google.com/open?id=15fSAvOA4rPGY8WOj-Ibvx1yz8aZb7Sms

It is 7-8MB. Sorry about that. Please find a single cell experiment object (sce) that is ~33K x 1588. It has not been QCd or normalized and contains raw counts.

shui

@LTLA
Copy link
Collaborator

LTLA commented Jun 1, 2019

Perhaps I was less than clear.

emptyDrops() needs the unfiltered matrix. If you only have 1588 cells, your matrix has already been filtered by CellRanger. You need the UNFILTERED matrix. Please, read the description HERE.

If you are trying to run emptyDrops() on this 1588-cell matrix, what you are doing is WRONG. Even if the function were to return something, the results would be NONSENSE.

I have now modified the function so that you CANNOT do what you are trying to do. The function will simply fail with an error message. This is the only way to protect you from yourself.

@s2hui
Copy link
Author

s2hui commented Jun 1, 2019

Yes you were less than clear to me but thanks for your help anyways.

shui

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants