Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clustering datasets with >50 million cells #185

Closed
AbbeyFigliomeni opened this issue Mar 5, 2024 · 10 comments
Closed

Clustering datasets with >50 million cells #185

AbbeyFigliomeni opened this issue Mar 5, 2024 · 10 comments

Comments

@AbbeyFigliomeni
Copy link

Hi,

Does anyone have experience clustering datasets with excess of 50 million cells? Mine has 59 million cells, average 1 million per person, and I keep getting the error "cannot allocate vector size xx gB/mB". FYI the dataset contains 14 clustering markers of interest.

I understand the flowSOM algorithm is not designed to handle datasets this large, but would prefer not to subset my data prior to clustering to avoid any loss of data/effects due to random sampling.

Any suggestions would be greatly appreciated! :)

@AbbeyFigliomeni
Copy link
Author

FYI this data is already pre-processed with exclusion of dead/doublets/debris

@tomashhurst
Copy link
Member

tomashhurst commented Mar 5, 2024 via email

@AbbeyFigliomeni
Copy link
Author

Hi Tom,
Thanks for your super prompt response!
The issue appears when running the "runflowsom" command. I use a code that extracts my data directly from my flowJo workspace using cytoML, which I have used in the past many times without issues. I also just ran the exact same dataset using the same script, just with my gatingSet object (cytoML) subsetted on a downstream gate containing much fewer cells (can provide code if it helps, but essentially the data has the exact same parameter scaling/structure but 800,000 events instead of 59mil), and it was able to complete the clustering and dimensionality reduction without issues. Could it be my processor only being a ryzen 5? This same computer took 10 hours to cluster a previous dataset with 40million events. What are some work-arounds that have worked in the past?

@SamGG
Copy link

SamGG commented Mar 5, 2024

@AbbeyFigliomeni Did you get a numerical value where the message says "xx"? Do you know the RAM amount in this computer?
Alternatively, as I am using RStudio on my Windows10 computer, in the environment tab there is pie chart showing the memory used. If I click on it and ask for a memory usage report, I get how much RAM is used by RStudio and is on my computer.
Alt., the command sum(sapply(ls(), function(x) object.size(get(x))))/1024^3 reports the amount (in GiB) of RAM currently used.
My two cents...

@AbbeyFigliomeni
Copy link
Author

AbbeyFigliomeni commented Mar 5, 2024 via email

@RoryCostell
Copy link

I have run into a similar issue with the spatial package, using the stars method to generate the polygons + outlines, for quite large IMC images (2000x2000) - The error shows "vector memory exhausted (limit reached?)"

I've used the fix on - https://stackoverflow.com/questions/51295402/r-on-macos-error-vector-memory-exhausted-limit-reached

R was using around 55Gb of RAM, so this fix extended the RAM available to R, including virtual memory.

Hope this helps.

@AbbeyFigliomeni
Copy link
Author

Hi All,
Thanks for your feedback. I am working with the hypothesis that it is simply a consequence of my desktop having insufficient RAM (16Gb, only 15gB of which is available to RStudio) to complete the task...

@ghar1821
Copy link
Member

ghar1821 commented Mar 20, 2024

I just tested running flowsom on around 50 million cells on my mac with 24GB ram, and I ran into the same issue. I'll look into the run.flowsom function and see if i can reduce its memory usage.

In the meantime, if you don't have access to computer with bigger ram, as an alternative you can subsample the cells, cluster them, and map the rest into the clusters. This is not ideal as the subsampling may not cover some cell types and caused them to be merged to other cell types (that were included in the subsampling process).

Or you can compress your data into supercells using supercellcyto (https://github.com/phipsonlab/SuperCellCyto) and run flowsom on those supercells. Afterwards, you can expand those supercells back and assign the cells in the supercell the cluster the supercell belong to. Disclaimer: I'm the author of supercellcyto.

@AbbeyFigliomeni
Copy link
Author

Hi all, thanks for all those who weighed in RE the RAM issue and @ghar1821 for your helpful feedback.

Just an update for anyone who is interested/facing the same issue: I managed to substantially decrease the size of my data table by deleting all phenodata columns except patient identifiers, and prior to clustering removing all other irrelevant objects from my workspace. I successfully clustered (although took 4 hours with my trust 16gB processor!), and then re-added my phenodata columns. Rest of the workflow as normal.

@tomashhurst
Copy link
Member

@AbbeyFigliomeni nice solution! We'll keep this in mind for when this comes up in future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants