Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Background distribution handling #74

Open
biodataganache opened this issue Aug 28, 2022 · 1 comment
Open

Background distribution handling #74

biodataganache opened this issue Aug 28, 2022 · 1 comment
Assignees
Labels
enhancement New feature or request

Comments

@biodataganache
Copy link
Collaborator

To address the issue of loading all kmer matrices in to memory for the model pipeline (both score and model rules do this) we can create background distributions from all kmers in a dataset. This could be constructed ahead of time - in a special pipeline 'build-dist' or can be done on the fly to build for each individual family in a thread. The score and model rules can load these background distributions - which will only be a bit bigger than the length of the kmers. Then combined, then used to score and model*. *model is something I'm not as clear about how to do.

@biodataganache
Copy link
Collaborator Author

This would allow the creation of generalized kmer background distribution files that could be pre-constructed and used for particular k/alphabet combinations. That would mean that the user wouldn't have to worry about supplying a background and could train a model that way. These could be included in the repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants