New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RAM usage by converting from sparse to dense #29
Comments
Hi @chbeltz, thank you for using CellTypist! I think the issue you're seeing has to do with the scaling technique that's used. Have a look at classifier.py > celltype, particularly at this: logger.info(f"⚖️ Scaling input data")
means_ = self.model.scaler.mean_[lr_idx]
sds_ = self.model.scaler.scale_[lr_idx]
self.indata = (self.indata[:, k_x_idx] - means_) / sds_
self.indata[self.indata > 10] = 10 Unfortunately, |
Ah, I wasn't aware that sparse matrices were densified upon substraction by a vector. That's unfortunate. Thanks! |
@chbeltz, adding to this point, during training, scaling will also densify the matrix. You can skip the feature selection step which uses all genes present in your data. That is, subset your data to a subset of useful genes (e.g., HVGs), and disable the options of feature selection and expression check during training, which would also reduce the RAM consumption. |
Can you imagine implementing a use_sparse switch that preserves sparsity by skipping the subtraction by the mean during scaling? Same principle as the with_mean option of sklearn.preprocessing.StandardScaler. |
@chbeltz, for SGD logistic regression, according to the sklearn package
Skipping the subtraction by the mean may not be a good practice, wdyt? |
@ChuanXu1 I have not been able to find a lot of empirical data on the effect of non-zero centered distributions of the input onto the performance of SGD, so I'm having a hard time weighing the pros and cons. However, I feel that if the alternative is that people with limited computing resources may decide to not use the software at all, it may be preferable to provide an option that may lead to less than optimal results, but to results nonetheless. |
@chbeltz, it sounds reasonable. I added these changes (a This parameter will be available at (next version of) CellTypist. Thx! |
Much appreciated, thank you!! |
Is there a reason why the input data is converted to an np.array rather than accepting sparse matrices when running .train? Skimming the remainder of the code, I cannot seem to find anything that would not also work with sparse matrices. The reason I am asking is that this conversion to an array seems to be the reasons I find myself running out of RAM quite frequently when working with larger datasets.
Thanks
The text was updated successfully, but these errors were encountered: