RAM usage by converting from sparse to dense #29

chbeltz · 2022-06-20T09:41:31Z

Is there a reason why the input data is converted to an np.array rather than accepting sparse matrices when running .train? Skimming the remainder of the code, I cannot seem to find anything that would not also work with sparse matrices. The reason I am asking is that this conversion to an array seems to be the reasons I find myself running out of RAM quite frequently when working with larger datasets.

Thanks

prete · 2022-06-20T10:21:59Z

Hi @chbeltz, thank you for using CellTypist!

I think the issue you're seeing has to do with the scaling technique that's used. Have a look at classifier.py > celltype, particularly at this:

        logger.info(f"⚖️ Scaling input data")
        means_ = self.model.scaler.mean_[lr_idx]
        sds_ = self.model.scaler.scale_[lr_idx]
        self.indata = (self.indata[:, k_x_idx] - means_) / sds_
        self.indata[self.indata > 10] = 10

Unfortunately, self.indata[:, k_x_idx] - means_ densifies the matrix, possibly leading to you running out of RAM with large datasets.

chbeltz · 2022-06-20T11:55:32Z

Ah, I wasn't aware that sparse matrices were densified upon substraction by a vector. That's unfortunate.

Thanks!

ChuanXu1 · 2022-06-20T15:42:44Z

@chbeltz, adding to this point, during training, scaling will also densify the matrix. You can skip the feature selection step which uses all genes present in your data. That is, subset your data to a subset of useful genes (e.g., HVGs), and disable the options of feature selection and expression check during training, which would also reduce the RAM consumption.

chbeltz · 2022-06-21T11:45:13Z

Can you imagine implementing a use_sparse switch that preserves sparsity by skipping the subtraction by the mean during scaling? Same principle as the with_mean option of sklearn.preprocessing.StandardScaler.

ChuanXu1 · 2022-06-21T20:49:57Z

@chbeltz, for SGD logistic regression, according to the sklearn package

For best results using the default learning rate schedule, the data should have zero mean and unit variance.

Skipping the subtraction by the mean may not be a good practice, wdyt?

chbeltz · 2022-06-26T11:17:25Z

@ChuanXu1 I have not been able to find a lot of empirical data on the effect of non-zero centered distributions of the input onto the performance of SGD, so I'm having a hard time weighing the pros and cons. However, I feel that if the alternative is that people with limited computing resources may decide to not use the software at all, it may be preferable to provide an option that may lead to less than optimal results, but to results nonetheless.

ChuanXu1 · 2022-06-27T23:10:23Z

@chbeltz, it sounds reasonable. I added these changes (a with_mean paramter in celltypist.train) to optimize the RAM usage during training at a possible cost of reduced performance dfb11e0

This parameter will be available at (next version of) CellTypist. Thx!

chbeltz · 2022-06-30T13:46:49Z

Much appreciated, thank you!!

ChuanXu1 added a commit that referenced this issue Jun 27, 2022

Lower training RAM usage by with_mean #29

dfb11e0

chbeltz closed this as completed Jun 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RAM usage by converting from sparse to dense #29

RAM usage by converting from sparse to dense #29

chbeltz commented Jun 20, 2022

prete commented Jun 20, 2022

chbeltz commented Jun 20, 2022

ChuanXu1 commented Jun 20, 2022

chbeltz commented Jun 21, 2022 •

edited

ChuanXu1 commented Jun 21, 2022

chbeltz commented Jun 26, 2022

ChuanXu1 commented Jun 27, 2022

chbeltz commented Jun 30, 2022

RAM usage by converting from sparse to dense #29

RAM usage by converting from sparse to dense #29

Comments

chbeltz commented Jun 20, 2022

prete commented Jun 20, 2022

chbeltz commented Jun 20, 2022

ChuanXu1 commented Jun 20, 2022

chbeltz commented Jun 21, 2022 • edited

ChuanXu1 commented Jun 21, 2022

chbeltz commented Jun 26, 2022

ChuanXu1 commented Jun 27, 2022

chbeltz commented Jun 30, 2022

chbeltz commented Jun 21, 2022 •

edited