Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify integrity and identity of databases #2

Open
dfornika opened this issue Oct 12, 2018 · 1 comment
Open

Verify integrity and identity of databases #2

dfornika opened this issue Oct 12, 2018 · 1 comment

Comments

@dfornika
Copy link
Member

The pipeline has several external data dependencies (databases for kraken2, mash sketches, etc). There should be a way to verify if those databases are in an expected state, or if there have been changes to them. For example, two 'standard' kraken2 databases that are built on different dates may have different contents due to the ever-changing contents of RefSeq.

We may be able to use some sort of pre-computed checksum to verify the database integrity. May not want to verify on every pipeline run because calculating the hashes can be slow. Maybe provide a separate 'database verification' script That could be run periodically or run once before a set of pipeline runs are submitted.

@ddooley
Copy link
Member

ddooley commented Oct 12, 2018

Take a peek at Kive http://cfe-lab.github.io/Kive/ - and ask Don Kirkby about what they did in the hashing department. Kive had great foresight in hashing all inputs and using that to be able to stop/continue jobs and know which parts had to be rerun. Not sure if that included reference databases but I wouldn't be surprised if so. They may have some quick hashing tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants