Cell2Location on large datasets #356

LinearParadox · 2024-03-19T23:25:14Z

Is there any way to manage memory useage on large datasets? For example, when you're approaching ~40000 spots and ~10000 genes, memory use becomes huge. Is there a way to train seperate conditions, and then somehow merge the results? I'm not sure if this would be a simple merge between anndata objects or if it's a more involved process.

vitkl · 2024-03-20T00:40:52Z

This is indeed a common problem. An approach we started using recently is splitting the data randomly (stratified by batch and possibly by anatomy annotation), training several models, then merging results.

LinearParadox · 2024-03-20T00:43:45Z

Is there any specific function you use to merge the data? Or is merging the anndata objects good enough? I'm not sure how I would merge several models, especially for some of the downstream analysis functions.

vitkl · 2024-03-20T04:51:03Z

You can concatenate anndata. There are two important parts to consider - cell abundance in adata.obsm and all other model parameters (+ essential record about input data) in adata.uns['mod']. Merging adata.uns['mod'] correctly is going to be more complex but you can save them as different adata.uns slots.

As a technical note, merging adata.uns['mod'] would require some way to properly combine location-independent parameters - which is straightforward for technical gene-specific effects but can be less straightforward for prior factorisation of cell abundance.

Rafael-Silva-Oliveira · 2024-04-16T08:30:22Z

This is indeed a common problem. An approach we started using recently is splitting the data randomly (stratified by batch and possibly by anatomy annotation), training several models, then merging results.

Is there any improvements coming soon in cell2location to take in the full data? That approach of splitting in batches removes any spatial information that might be used to do cell deconvolution

With the new VisiumHD being more proeminent, the number of spots or bins can range anywhere from 150k to 650k+, so when I test with cell2location, I always get memory errors.

And doing in batches also doesn't work:

Any near future implementations to support datasets such as VisiumHD?

Li-ZhiD · 2024-05-30T02:12:49Z

Hi,
I am using cell2location to deconve large Stereo-seq data.
1, Could you show how to merge the trained mod by splitting spatial data?
2, How much does it affect the results by reducing 30000 (long time in CPU mode) to 300 of max_epochs in training spatial data？
Thank you!

vitkl · 2024-07-13T16:45:15Z

@Li-ZhiD @Dillon214

Split the object into batches

# for every "sample", sample with replacement from chunks to allocate 
# some locations from each batch to all training batches
chunk_size = 72_000
chunks = [i for i in range(int(np.ceil(adata_vis.n_obs / chunk_size)))]

adata_vis.obs['training_batch'] = 0
for sample in adata_vis.obs['sample'].unique():
    ind = adata_vis.obs['sample'].isin([sample])
    adata_vis.obs.loc[ind, 'training_batch'] = np.random.choice(
        chunks, size=ind.sum(), replace=True, p=None
    )
    
adata_vis_full = adata_vis.copy()
for k in ['means', 'stds', 'q05', 'q95']:
    adata_vis_full.obsm[f"{k}_cell_abundance_w_sf"] = np.zeros((adata_vis_full.n_obs, inf_aver.shape[1]))
    
adata_vis.obs['training_batch'].value_counts()

Train

seed = 0
scvi.settings.seed = seed
np.random.seed(seed)

# submit this chunk as separate jobs
for batch in adata_vis.obs['training_batch'].unique():
    # create and train the model
    scvi_run_name = f'{run_name_global}_batch{batch}_seed{seed}'
    print(scvi_run_name)

    training_batch_index = adata_vis_full.obs['training_batch'].isin([batch])
    adata_vis = adata_vis_full[training_batch_index, :].copy()
    
    # prepare anndata for scVI model
    cell2location.models.Cell2location.setup_anndata(
        adata=adata_vis, batch_key="sample"
    )

    # train as normal
   
    # export posterior
    import pyro
    # In this section, we export the estimated cell abundance (summary of the posterior distribution).
    adata_vis = mod.export_posterior(
        adata_vis, sample_kwargs={
            'batch_size': int(np.ceil(adata_vis.n_obs / 4)), 'accelerator': 'gpu',
            "return_observed": False,
        },
        add_to_obsm=['q05', 'q50'],
        use_quantiles=True,
    )

Complete the full object with cell abundance estimates from batched analysis

    # copy cell2location results to the main object
    for k in adata_vis_full.obsm.keys():
        adata_vis_full.obsm[k][training_batch_index, :] = adata_vis.obsm[k].copy()
    adata_vis_full.uns[f'mod_{batch}'] = adata_vis.uns['mod'].copy()

floriankreten · 2024-09-20T14:21:54Z

I have a cohort of 10x Visium slides, that give 20k spots and 10k genes in total (after filtering). Would you recommend to use the individual slides as batches, and simplify the above approach for this special use-case?

vitkl · 2024-10-06T04:24:25Z

20k spots and 10k genes in total

You should be able to analyse this on one A100 40GB GPU.

Li-ZhiD · 2024-10-14T08:27:23Z

Hi,
My memory of GPU only has 12GB, and it may fall far short of the memory required for a 100K spot for a stereo-seq chip.
I tried CPU mode by setting

os.environ["THEANO_FLAGS"] = 'device=cpu,floatX=float32,openmp=True,force_device=True'

it doesn't work, still use GPU.

or tried setting "use_gpu=False"：

mod.train(max_epochs=30000,
          # train using full data (batch_size=None)
          batch_size=None,
          # use all data points in training because
          # we need to estimate cell abundance at all locations
          train_size=1,
          use_gpu=False
         )

an error raised:

Trainer.__init__() got an unexpected keyword argument 'use_gpu'

What can I try if I want to use cell2location (1.2.0)? Thanks!

vitkl · 2024-10-31T12:36:44Z

Cell2location doesn't use theano and your code suggest that you are using the older version that did (2019). I would recommend using the latest version.

mod.train(
          ...
          device='cpu'
         )

Li-ZhiD · 2024-10-31T12:49:48Z

Thank you!

vitkl · 2024-10-31T14:02:40Z

I would recommend to get access to GPUs with more memory (eg on various cloud platforms) because training on CPU is much slower and also more expensive because you need a lot of CPU resources for a very long time.

LinearParadox added the enhancement New feature or request label Mar 19, 2024

Rafael-Silva-Oliveira mentioned this issue Apr 16, 2024

Compatibility with VisiumHD #358

Closed

vitkl mentioned this issue May 29, 2024

"OutOfMemoryError: CUDA out of memory." in CPU mode? #366

Open

vitkl mentioned this issue Jul 13, 2024

Performance issues on a large (120724) location dataset #375

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cell2Location on large datasets #356

Cell2Location on large datasets #356

LinearParadox commented Mar 19, 2024

vitkl commented Mar 20, 2024

LinearParadox commented Mar 20, 2024

vitkl commented Mar 20, 2024 •

edited

Loading

Rafael-Silva-Oliveira commented Apr 16, 2024

Li-ZhiD commented May 30, 2024 •

edited

Loading

vitkl commented Jul 13, 2024

floriankreten commented Sep 20, 2024

vitkl commented Oct 6, 2024

Li-ZhiD commented Oct 14, 2024

vitkl commented Oct 31, 2024 •

edited

Loading

Li-ZhiD commented Oct 31, 2024

vitkl commented Oct 31, 2024

Cell2Location on large datasets #356

Cell2Location on large datasets #356

Comments

LinearParadox commented Mar 19, 2024

vitkl commented Mar 20, 2024

LinearParadox commented Mar 20, 2024

vitkl commented Mar 20, 2024 • edited Loading

Rafael-Silva-Oliveira commented Apr 16, 2024

Li-ZhiD commented May 30, 2024 • edited Loading

vitkl commented Jul 13, 2024

floriankreten commented Sep 20, 2024

vitkl commented Oct 6, 2024

Li-ZhiD commented Oct 14, 2024

vitkl commented Oct 31, 2024 • edited Loading

Li-ZhiD commented Oct 31, 2024

vitkl commented Oct 31, 2024

vitkl commented Mar 20, 2024 •

edited

Loading

Li-ZhiD commented May 30, 2024 •

edited

Loading

vitkl commented Oct 31, 2024 •

edited

Loading