Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RAM usage by converting from sparse to dense #29

Closed
chbeltz opened this issue Jun 20, 2022 · 8 comments
Closed

RAM usage by converting from sparse to dense #29

chbeltz opened this issue Jun 20, 2022 · 8 comments

Comments

@chbeltz
Copy link

chbeltz commented Jun 20, 2022

Is there a reason why the input data is converted to an np.array rather than accepting sparse matrices when running .train? Skimming the remainder of the code, I cannot seem to find anything that would not also work with sparse matrices. The reason I am asking is that this conversion to an array seems to be the reasons I find myself running out of RAM quite frequently when working with larger datasets.

Thanks

@prete
Copy link
Collaborator

prete commented Jun 20, 2022

Hi @chbeltz, thank you for using CellTypist!

I think the issue you're seeing has to do with the scaling technique that's used. Have a look at classifier.py > celltype, particularly at this:

        logger.info(f"⚖️ Scaling input data")
        means_ = self.model.scaler.mean_[lr_idx]
        sds_ = self.model.scaler.scale_[lr_idx]
        self.indata = (self.indata[:, k_x_idx] - means_) / sds_
        self.indata[self.indata > 10] = 10

Unfortunately, self.indata[:, k_x_idx] - means_ densifies the matrix, possibly leading to you running out of RAM with large datasets.

@chbeltz
Copy link
Author

chbeltz commented Jun 20, 2022

Ah, I wasn't aware that sparse matrices were densified upon substraction by a vector. That's unfortunate.

Thanks!

@ChuanXu1
Copy link
Collaborator

@chbeltz, adding to this point, during training, scaling will also densify the matrix. You can skip the feature selection step which uses all genes present in your data. That is, subset your data to a subset of useful genes (e.g., HVGs), and disable the options of feature selection and expression check during training, which would also reduce the RAM consumption.

@chbeltz
Copy link
Author

chbeltz commented Jun 21, 2022

Can you imagine implementing a use_sparse switch that preserves sparsity by skipping the subtraction by the mean during scaling? Same principle as the with_mean option of sklearn.preprocessing.StandardScaler.

@ChuanXu1
Copy link
Collaborator

@chbeltz, for SGD logistic regression, according to the sklearn package

For best results using the default learning rate schedule, the data should have zero mean and unit variance.

Skipping the subtraction by the mean may not be a good practice, wdyt?

@chbeltz
Copy link
Author

chbeltz commented Jun 26, 2022

@ChuanXu1 I have not been able to find a lot of empirical data on the effect of non-zero centered distributions of the input onto the performance of SGD, so I'm having a hard time weighing the pros and cons. However, I feel that if the alternative is that people with limited computing resources may decide to not use the software at all, it may be preferable to provide an option that may lead to less than optimal results, but to results nonetheless.

@ChuanXu1
Copy link
Collaborator

@chbeltz, it sounds reasonable. I added these changes (a with_mean paramter in celltypist.train) to optimize the RAM usage during training at a possible cost of reduced performance dfb11e0

This parameter will be available at (next version of) CellTypist. Thx!

@chbeltz
Copy link
Author

chbeltz commented Jun 30, 2022

Much appreciated, thank you!!

@chbeltz chbeltz closed this as completed Jun 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants