-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cell2Location on large datasets #356
Comments
This is indeed a common problem. An approach we started using recently is splitting the data randomly (stratified by batch and possibly by anatomy annotation), training several models, then merging results. |
Is there any specific function you use to merge the data? Or is merging the anndata objects good enough? I'm not sure how I would merge several models, especially for some of the downstream analysis functions. |
You can concatenate anndata. There are two important parts to consider - cell abundance in As a technical note, merging |
Is there any improvements coming soon in cell2location to take in the full data? That approach of splitting in batches removes any spatial information that might be used to do cell deconvolution With the new VisiumHD being more proeminent, the number of spots or bins can range anywhere from 150k to 650k+, so when I test with cell2location, I always get memory errors. And doing in batches also doesn't work: Any near future implementations to support datasets such as VisiumHD? |
Hi, |
Split the object into batches # for every "sample", sample with replacement from chunks to allocate
# some locations from each batch to all training batches
chunk_size = 72_000
chunks = [i for i in range(int(np.ceil(adata_vis.n_obs / chunk_size)))]
adata_vis.obs['training_batch'] = 0
for sample in adata_vis.obs['sample'].unique():
ind = adata_vis.obs['sample'].isin([sample])
adata_vis.obs.loc[ind, 'training_batch'] = np.random.choice(
chunks, size=ind.sum(), replace=True, p=None
)
adata_vis_full = adata_vis.copy()
for k in ['means', 'stds', 'q05', 'q95']:
adata_vis_full.obsm[f"{k}_cell_abundance_w_sf"] = np.zeros((adata_vis_full.n_obs, inf_aver.shape[1]))
adata_vis.obs['training_batch'].value_counts() Train seed = 0
scvi.settings.seed = seed
np.random.seed(seed)
# submit this chunk as separate jobs
for batch in adata_vis.obs['training_batch'].unique():
# create and train the model
scvi_run_name = f'{run_name_global}_batch{batch}_seed{seed}'
print(scvi_run_name)
training_batch_index = adata_vis_full.obs['training_batch'].isin([batch])
adata_vis = adata_vis_full[training_batch_index, :].copy()
# prepare anndata for scVI model
cell2location.models.Cell2location.setup_anndata(
adata=adata_vis, batch_key="sample"
)
# train as normal
# export posterior
import pyro
# In this section, we export the estimated cell abundance (summary of the posterior distribution).
adata_vis = mod.export_posterior(
adata_vis, sample_kwargs={
'batch_size': int(np.ceil(adata_vis.n_obs / 4)), 'accelerator': 'gpu',
"return_observed": False,
},
add_to_obsm=['q05', 'q50'],
use_quantiles=True,
) Complete the full object with cell abundance estimates from batched analysis # copy cell2location results to the main object
for k in adata_vis_full.obsm.keys():
adata_vis_full.obsm[k][training_batch_index, :] = adata_vis.obsm[k].copy()
adata_vis_full.uns[f'mod_{batch}'] = adata_vis.uns['mod'].copy() |
I have a cohort of 10x Visium slides, that give 20k spots and 10k genes in total (after filtering). Would you recommend to use the individual slides as batches, and simplify the above approach for this special use-case? |
You should be able to analyse this on one A100 40GB GPU. |
Hi,
it doesn't work, still use GPU. or tried setting "use_gpu=False":
an error raised:
What can I try if I want to use cell2location (1.2.0)? Thanks! |
Cell2location doesn't use theano and your code suggest that you are using the older version that did (2019). I would recommend using the latest version. mod.train(
...
device='cpu'
) |
Thank you! |
I would recommend to get access to GPUs with more memory (eg on various cloud platforms) because training on CPU is much slower and also more expensive because you need a lot of CPU resources for a very long time. |
Is there any way to manage memory useage on large datasets? For example, when you're approaching ~40000 spots and ~10000 genes, memory use becomes huge. Is there a way to train seperate conditions, and then somehow merge the results? I'm not sure if this would be a simple merge between anndata objects or if it's a more involved process.
The text was updated successfully, but these errors were encountered: