segfault, caught bus error, non-existent physical address #188

ahwanpandey · 2024-05-13T04:53:49Z

Hello,

Thanks for this tool.

I submitted some "run_numbat" jobs to to our cluster. It seems to have output all the results files and also seems the plots and data files are all there.

But the std err of the job output has a bunch of errors. The job State says "OUT_OF_MEMORY", but with an exit code of 0 meaning it was successful. Also the Memory utilised is 470.93 Gb.

Job ID: 19079833
Cluster: rosalind
User/Group: apandey@petermac.org.au/apandey
State: OUT_OF_MEMORY (exit code 0)
Nodes: 1
Cores per node: 16
CPU Utilized: 1-11:19:41
CPU Efficiency: 53.46% of 2-18:04:48 core-walltime
Job Wall-clock time: 04:07:48
Memory Utilized: 470.93 GB
Memory Efficiency: 470.93% of 100.00 GB

I've attached the log and the std err as follows

log.txt
Numbat.AOCS_055_2_0.Step2_run_numbat.19079833.papr-res-compute01.err.txt

Here is my R sessionInfo()

> sessionInfo()
R version 4.2.0 (2022-04-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /usr/lib64/libblas.so.3.4.2
LAPACK: /usr/lib64/liblapack.so.3.4.2

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_1.1.4        data.table_1.15.4  sp_2.1-3           SeuratObject_4.1.0 Seurat_4.1.1       numbat_1.4.0       Matrix_1.5-4      

loaded via a namespace (and not attached):
  [1] Rtsne_0.17            colorspace_2.1-0      ggtree_3.4.0          deldir_2.0-4          scistreer_1.2.0       ggridges_0.5.6        fs_1.6.4              aplot_0.2.2          
  [9] spatstat.data_3.0-4   rstudioapi_0.16.0     leiden_0.4.3.1        listenv_0.9.1         farver_2.1.1          graphlayouts_1.1.1    ggrepel_0.9.5         fansi_1.0.6          
 [17] hahmmr_1.0.0          codetools_0.2-20      splines_4.2.0         cachem_1.0.8          polyclip_1.10-6       jsonlite_1.8.8        RhpcBLASctl_0.23-42   ica_1.0-3            
 [25] cluster_2.1.6         png_0.1-8             rgeos_0.5-9           uwot_0.2.2            spatstat.sparse_3.0-3 sctransform_0.4.1     ggforce_0.4.2         shiny_1.8.1.1        
 [33] compiler_4.2.0        httr_1.4.7            fastmap_1.1.1         lazyeval_0.2.2        cli_3.6.2             later_1.3.2           tweenr_2.0.3          htmltools_0.5.8.1    
 [41] tools_4.2.0           igraph_2.0.3          gtable_0.3.5          glue_1.7.0            reshape2_1.4.4        RANN_2.6.1            fastmatch_1.1-4       Rcpp_1.0.12          
 [49] scattermore_1.2       vctrs_0.6.5           ape_5.8               nlme_3.1-164          progressr_0.14.0      ggraph_2.2.1          lmtest_0.9-40         spatstat.random_3.2-3
 [57] stringr_1.5.1         globals_0.16.3        mime_0.12             miniUI_0.1.1.1        lifecycle_1.0.4       irlba_2.3.5.1         phangorn_2.11.1       goftest_1.2-3        
 [65] future_1.33.2         MASS_7.3-57           zoo_1.8-12            scales_1.3.0          tidygraph_1.3.1       spatstat.core_2.4-4   spatstat.utils_3.0-4  promises_1.3.0       
 [73] parallel_4.2.0        RColorBrewer_1.1-3    pbapply_1.7-2         memoise_2.0.1         reticulate_1.36.1     gridExtra_2.3         ggplot2_3.5.1         ggfun_0.1.4          
 [81] yulab.utils_0.1.4     rpart_4.1.23          stringi_1.8.3         tidytree_0.4.6        rlang_1.1.3           pkgconfig_2.0.3       matrixStats_1.3.0     parallelDist_0.2.6   
 [89] lattice_0.22-6        tensor_1.5            ROCR_1.0-11           purrr_1.0.2           htmlwidgets_1.6.4     treeio_1.20.0         patchwork_1.2.0       cowplot_1.1.3        
 [97] tidyselect_1.2.1      parallelly_1.37.1     RcppAnnoy_0.0.22      plyr_1.8.9            logger_0.3.0          magrittr_2.0.3        R6_2.5.1              generics_0.1.3       
[105] DBI_1.2.2             mgcv_1.9-1            pillar_1.9.0          withr_3.0.0           fitdistrplus_1.1-11   abind_1.4-5           survival_3.6-4        tibble_3.2.1         
[113] future.apply_1.11.2   KernSmooth_2.23-22    utf8_1.2.4            spatstat.geom_3.2-9   plotly_4.10.4         viridis_0.6.5         grid_4.2.0            digest_0.6.35        
[121] xtable_1.8-4          tidyr_1.3.1           httpuv_1.6.15         gridGraphics_0.5-1    RcppParallel_5.1.7    munsell_0.5.1         viridisLite_0.4.2     ggplotify_0.1.2      
[129] quadprog_1.5-8

This happens with all the samples I have run so far (about 20). I am just attaching the output of one sample as a reference. The samples have anywhere from 6000 - 22000 cells. For example here is another sample's std err and log:

log.txt
Numbat.AOCS_060_2_9.Step2_run_numbat.19079835.papr-res-compute02.err.txt

Not sure if all of this is normal behaviour of the tool or something is wrong?

Thanks so much,
Ahwan

The text was updated successfully, but these errors were encountered:

teng-gao · 2024-05-13T19:39:22Z

Hmm. The error message below is suspicious. I would guess it's a problem related to general memory management on your jobs/cluster.

slurmstepd: error: Detected 167 oom_kill events in StepId=19079835.batch. Some of the step tasks have been OOM Killed.

ahwanpandey · 2024-05-14T23:31:49Z

Hi @teng-gao . Thanks for the reply. I will try to run one sample with just 1 thread/core and see what that looks like. Is there anything in particular you think I could ask the cluster folks re: their memory management? I believe Numbat is the only software/tool I have used that I have seen this type of error in our cluster.

Thanks,
Ahwan

ahwanpandey · 2024-05-15T00:05:08Z

And I mean sure I've had segfaults and out of memory issues which have been fixed by providing more memory, but this seems different. And also the memory utilised in the job status is way too high for all the jobs:

Job ID: 19079833
Cluster: rosalind
User/Group: apandey@petermac.org.au/apandey
State: OUT_OF_MEMORY (exit code 0)
Nodes: 1
Cores per node: 16
CPU Utilized: 1-11:19:41
CPU Efficiency: 53.46% of 2-18:04:48 core-walltime
Job Wall-clock time: 04:07:48
Memory Utilized: 470.93 GB
Memory Efficiency: 470.93% of 100.00 GB_

Below are some jobs and their States and Memory utilised. Strangely, one of them says State: COMPLETED and the std err has no errors like I've mentioned above, even though it is using a lot more memory than I have asked for.

State: OUT_OF_MEMORY (exit code 0)
Memory Utilized: 692.82 GB
Memory Efficiency: 692.82% of 100.00 GB

State: OUT_OF_MEMORY (exit code 0)
Memory Utilized: 724.67 GB
Memory Efficiency: 724.67% of 100.00 GB

State: OUT_OF_MEMORY (exit code 0)
Memory Utilized: 433.26 GB
Memory Efficiency: 433.26% of 100.00 GB

State: OUT_OF_MEMORY (exit code 0)
Memory Utilized: 488.84 GB
Memory Efficiency: 488.84% of 100.00 GB

State: OUT_OF_MEMORY (exit code 0)
Memory Utilized: 594.87 GB
Memory Efficiency: 594.87% of 100.00 GB

State: COMPLETED (exit code 0)
Memory Utilized: 301.44 GB
Memory Efficiency: 301.44% of 100.00 GB

State: OUT_OF_MEMORY (exit code 0)
Memory Utilized: 561.10 GB
Memory Efficiency: 561.10% of 100.00 GB

ahwanpandey · 2024-05-16T00:29:09Z

OK running with just one thread has no issues. Note that I am just using the default "run_numbat" with 'ref_hca" as reference but will be trying with a custom reference as well. The results are vastly different which probably makes sense as a lot of the threads were internally killed by slurm.

Do you think Numbat could benefit from have some sort form error handling for these multi threaded memory issues and not look like it completed? I was testing initially in an interactive session and didn't realise that this was happening in the background. There was no hint in the R terminal about memory issues and threads being killed. The output folder just looks like everything completed without issues, until I submitted the script as a job and checked the std err. Not saying this might be happening to others, but maybe a possibility some other users might have this happening without their knowledge? Hence just suggesting Numbat to notify the user or error out.

But again I might be totally wrong and this could just be a very specific issue with the cluster I am using! I'll talk to the cluster admins about this but would love to hear if you have any specific thoughts on what they could look at as a start.

single thread stats and results/logs/err for a sample

Job ID: 19083767
Cluster: rosalind
User/Group: apandey@petermac.org.au/apandey
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 14:47:00
CPU Efficiency: 99.99% of 14:47:06 core-walltime
Job Wall-clock time: 14:47:06
Memory Utilized: 32.64 GB
Memory Efficiency: 32.64% of 100.00 GB

bulk_clones_final.png

log.txt
Numbat.AOCS_080_2_2.Step2_run_numbat.19083767.papr-res-compute215.err.txt

multi-thread stats and results/logs/err for the same sample

Job ID: 19079838
Cluster: rosalind
User/Group: apandey@petermac.org.au/apandey
State: OUT_OF_MEMORY (exit code 0)
Nodes: 1
Cores per node: 16
CPU Utilized: 11:45:45
CPU Efficiency: 43.14% of 1-03:16:00 core-walltime
Job Wall-clock time: 01:42:15
Memory Utilized: 366.78 GB
Memory Efficiency: 366.78% of 100.00 GB

bulk_clones_final.png

log.txt
Numbat.AOCS_080_2_2.Step2_run_numbat.19079838.papr-res-compute215.err.txt

ahwanpandey · 2024-05-17T05:51:06Z

I had a chat with our cluster admin and just wanted to share some thoughts with you.

Seems Numbat has the following memory assumptions:

A sample completes successfully with max memory usage of 32Gb when run with 1 thread
The same sample now needs roughly 32Gb*16 = 512Gb when run with 16 threads

Am I understanding this right?

It seems to be a similar thing to the one being described below:
MonashBioinformaticsPlatform/RNAsik-pipe#39

So if the above is correct, do you think Numbat should break or stop the run if any thread fails, and then exit with an overall error exit code? Like a consensus exit code; if all threads succeeded then 0, otherwise non-zero? And also have some sort of indication that an error occurred during the run in the log file?

Thanks!
Ahwan

ahwanpandey · 2024-05-20T03:08:43Z

OK so running with 4 threads and allocating 160Gb let me run the Numbat jobs successfully. I checked the std err for each and no memory issues. Also, the SLURM config in our cluster is setup such that it allows a job to go a little bit over mem, depending on the request/usage of other jobs in the node. Using 16 threads goes way over and starts killing threads as mentioned in the original issue.

State: COMPLETED (exit code 0)
Cores per node: 4
Memory Utilized: 210.72 GB
Memory Efficiency: 131.70% of 160.00 GB

State: COMPLETED (exit code 0)
Cores per node: 4
Memory Utilized: 203.70 GB
Memory Efficiency: 127.31% of 160.00 GB

State: COMPLETED (exit code 0)
Cores per node: 4
Memory Utilized: 170.53 GB
Memory Efficiency: 106.58% of 160.00 GB

State: COMPLETED (exit code 0)
Cores per node: 4
Memory Utilized: 181.04 GB
Memory Efficiency: 113.15% of 160.00 GB

State: COMPLETED (exit code 0)
Cores per node: 4
Memory Utilized: 159.93 GB
Memory Efficiency: 99.96% of 160.00 GB

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

segfault, caught bus error, non-existent physical address #188

segfault, caught bus error, non-existent physical address #188

ahwanpandey commented May 13, 2024

teng-gao commented May 13, 2024

ahwanpandey commented May 14, 2024

ahwanpandey commented May 15, 2024

ahwanpandey commented May 16, 2024 •

edited

ahwanpandey commented May 17, 2024

ahwanpandey commented May 20, 2024

segfault, caught bus error, non-existent physical address #188

segfault, caught bus error, non-existent physical address #188

Comments

ahwanpandey commented May 13, 2024

teng-gao commented May 13, 2024

ahwanpandey commented May 14, 2024

ahwanpandey commented May 15, 2024

ahwanpandey commented May 16, 2024 • edited

single thread stats and results/logs/err for a sample

multi-thread stats and results/logs/err for the same sample

ahwanpandey commented May 17, 2024

ahwanpandey commented May 20, 2024

ahwanpandey commented May 16, 2024 •

edited