Clustering datasets with >50 million cells #185

AbbeyFigliomeni · 2024-03-05T09:41:57Z

Hi,

Does anyone have experience clustering datasets with excess of 50 million cells? Mine has 59 million cells, average 1 million per person, and I keep getting the error "cannot allocate vector size xx gB/mB". FYI the dataset contains 14 clustering markers of interest.

I understand the flowSOM algorithm is not designed to handle datasets this large, but would prefer not to subset my data prior to clustering to avoid any loss of data/effects due to random sampling.

Any suggestions would be greatly appreciated! :)

AbbeyFigliomeni · 2024-03-05T09:50:45Z

FYI this data is already pre-processed with exclusion of dead/doublets/debris

tomashhurst · 2024-03-05T10:41:55Z

Hi Abbey, Are you seeing this at the FlowSOM step or earlier? This issue mostly comes up as a kind of data handling issue, but we have run datasets of that size through FlowSOM with no problems. It has come up in some other functions before, but we have a couple of options to get around it. Tom

…

________________________________ From: Abbey Figliomeni ***@***.***> Sent: Tuesday, March 5, 2024 8:50:57 PM To: ImmuneDynamics/Spectre ***@***.***> Cc: Subscribed ***@***.***> Subject: Re: [ImmuneDynamics/Spectre] Clustering datasets with >50 million cells (Issue #185) FYI this data is already pre-processed with exclusion of dead/doublets/debris — Reply to this email directly, view it on GitHub<#185 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ACZYS66Q4XV4TTXWAXR22G3YWWIQDAVCNFSM6AAAAABEGY3Y3GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZYGM2TONBWGQ>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

AbbeyFigliomeni · 2024-03-05T11:09:56Z

Hi Tom,
Thanks for your super prompt response!
The issue appears when running the "runflowsom" command. I use a code that extracts my data directly from my flowJo workspace using cytoML, which I have used in the past many times without issues. I also just ran the exact same dataset using the same script, just with my gatingSet object (cytoML) subsetted on a downstream gate containing much fewer cells (can provide code if it helps, but essentially the data has the exact same parameter scaling/structure but 800,000 events instead of 59mil), and it was able to complete the clustering and dimensionality reduction without issues. Could it be my processor only being a ryzen 5? This same computer took 10 hours to cluster a previous dataset with 40million events. What are some work-arounds that have worked in the past?

SamGG · 2024-03-05T14:42:06Z

@AbbeyFigliomeni Did you get a numerical value where the message says "xx"? Do you know the RAM amount in this computer?
Alternatively, as I am using RStudio on my Windows10 computer, in the environment tab there is pie chart showing the memory used. If I click on it and ask for a memory usage report, I get how much RAM is used by RStudio and is on my computer.
Alt., the command sum(sapply(ls(), function(x) object.size(get(x))))/1024^3 reports the amount (in GiB) of RAM currently used.
My two cents...

AbbeyFigliomeni · 2024-03-05T23:06:15Z

Hi Samuel, the number differed when deleting or including certain cluster marker channels, ranged from GB to mb. I have been looking at the pie chart/control panel and the program has been using a large amount of memory, will check the RAM. Thanks! Sent from my phone - please excuse typos.

…

________________________________ From: Samuel Granjeaud ***@***.***> Sent: Tuesday, March 5, 2024 10:42:21 PM To: ImmuneDynamics/Spectre ***@***.***> Cc: Abbey Figliomeni ***@***.***>; Mention ***@***.***> Subject: Re: [ImmuneDynamics/Spectre] Clustering datasets with >50 million cells (Issue #185) @AbbeyFigliomeni<https://github.com/AbbeyFigliomeni> Did you get a numerical value where the message says "xx"? Do you know the RAM amount in this computer? Alternatively, as I am using RStudio on my Windows10 computer, in the environment tab there is pie chart showing the memory used. If I click on it and ask for a memory usage report, I get how much RAM is used by RStudio and is on my computer. Alt., the command sum(sapply(ls(), function(x) object.size(get(x))))/1024^3 reports the amount (in GiB) of RAM currently used. My two cents... — Reply to this email directly, view it on GitHub<#185 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/A5NDZEDT4OJJYFY64AL4CILYWXKU3AVCNFSM6AAAAABEGY3Y3GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZYHEZDONJXGY>. You are receiving this because you were mentioned.Message ID: ***@***.***>

RoryCostell · 2024-03-17T15:43:34Z

I have run into a similar issue with the spatial package, using the stars method to generate the polygons + outlines, for quite large IMC images (2000x2000) - The error shows "vector memory exhausted (limit reached?)"

I've used the fix on - https://stackoverflow.com/questions/51295402/r-on-macos-error-vector-memory-exhausted-limit-reached

R was using around 55Gb of RAM, so this fix extended the RAM available to R, including virtual memory.

Hope this helps.

AbbeyFigliomeni · 2024-03-20T06:03:23Z

Hi All,
Thanks for your feedback. I am working with the hypothesis that it is simply a consequence of my desktop having insufficient RAM (16Gb, only 15gB of which is available to RStudio) to complete the task...

ghar1821 · 2024-03-20T06:23:52Z

I just tested running flowsom on around 50 million cells on my mac with 24GB ram, and I ran into the same issue. I'll look into the run.flowsom function and see if i can reduce its memory usage.

In the meantime, if you don't have access to computer with bigger ram, as an alternative you can subsample the cells, cluster them, and map the rest into the clusters. This is not ideal as the subsampling may not cover some cell types and caused them to be merged to other cell types (that were included in the subsampling process).

Or you can compress your data into supercells using supercellcyto (https://github.com/phipsonlab/SuperCellCyto) and run flowsom on those supercells. Afterwards, you can expand those supercells back and assign the cells in the supercell the cluster the supercell belong to. Disclaimer: I'm the author of supercellcyto.

AbbeyFigliomeni · 2024-03-30T10:25:18Z

Hi all, thanks for all those who weighed in RE the RAM issue and @ghar1821 for your helpful feedback.

Just an update for anyone who is interested/facing the same issue: I managed to substantially decrease the size of my data table by deleting all phenodata columns except patient identifiers, and prior to clustering removing all other irrelevant objects from my workspace. I successfully clustered (although took 4 hours with my trust 16gB processor!), and then re-added my phenodata columns. Rest of the workflow as normal.

tomashhurst · 2024-05-24T15:18:03Z

@AbbeyFigliomeni nice solution! We'll keep this in mind for when this comes up in future.

tomashhurst closed this as completed May 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clustering datasets with >50 million cells #185

Clustering datasets with >50 million cells #185

AbbeyFigliomeni commented Mar 5, 2024

AbbeyFigliomeni commented Mar 5, 2024

tomashhurst commented Mar 5, 2024 via email

AbbeyFigliomeni commented Mar 5, 2024

SamGG commented Mar 5, 2024

AbbeyFigliomeni commented Mar 5, 2024 via email

RoryCostell commented Mar 17, 2024

AbbeyFigliomeni commented Mar 20, 2024

ghar1821 commented Mar 20, 2024 •

edited

Loading

AbbeyFigliomeni commented Mar 30, 2024

tomashhurst commented May 24, 2024

Clustering datasets with >50 million cells #185

Clustering datasets with >50 million cells #185

Comments

AbbeyFigliomeni commented Mar 5, 2024

AbbeyFigliomeni commented Mar 5, 2024

tomashhurst commented Mar 5, 2024 via email

AbbeyFigliomeni commented Mar 5, 2024

SamGG commented Mar 5, 2024

AbbeyFigliomeni commented Mar 5, 2024 via email

RoryCostell commented Mar 17, 2024

AbbeyFigliomeni commented Mar 20, 2024

ghar1821 commented Mar 20, 2024 • edited Loading

AbbeyFigliomeni commented Mar 30, 2024

tomashhurst commented May 24, 2024

ghar1821 commented Mar 20, 2024 •

edited

Loading