-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incremental import, plus bug fix and doc/code cleanup #28
Conversation
…ering of ID field in vcf
…isting callset. added unit test as well
… warning fixes too
Codecov Report
@@ Coverage Diff @@
## develop #28 +/- ##
===========================================
+ Coverage 74.22% 76.12% +1.89%
===========================================
Files 113 113
Lines 16128 16223 +95
Branches 257 267 +10
===========================================
+ Hits 11971 12349 +378
+ Misses 4016 3714 -302
- Partials 141 160 +19
Continue to review full report at Codecov.
|
if (value != null) { | ||
throw new GenomicsDBException("Duplicate sample name found: "+sampleName+". Sample "+ | ||
"was originally in "+value); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be useful to get a list of all duplicates before throwing the exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm following what GATK does here for duplicates within an import - they throw exception on the first duplicate. Do you think a (potentially long) list of duplicates would be useful to the users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could stop at some predetermined number. But, only if this feature is useful in the first place. Your call.
Is it possible to save the original callset file and maybe a list with original fragment names before overwriting, basically the state of the filesystem starting at the workspace, even if we don't offer rollback? |
Could save the original callset and fragment names - but if we're doing that, we should probably provide a tool that uses that to recover the original workspace as well... |
Agreed, we need a tool to recover the original workspace, but we can start by saving the original callset and fragment names. We should open an issue for writing the tool to recover the original workspace based on what can be gathered from the saved artifacts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Incremental import support for GenomicsDBImporter.
This dragged on for a while, so I also pulled in a bug fix and some other cleanup. This will currently error out if duplicate callsets/samples are passed in (specifically, duplicates between previously imported samples and current ones).
Caveat emptor: We'll overwrite the existing callset file as part of this, and offer no guarantees as to the integrity of the workspace. That is, if incremental import fails for whatever reason, some of the arrays, callset files, etc might be updated while others may not. We don't offer rollback either.