Incremental import, plus bug fix and doc/code cleanup #28

mlathara · 2019-04-12T04:13:51Z

Incremental import support for GenomicsDBImporter.

This dragged on for a while, so I also pulled in a bug fix and some other cleanup. This will currently error out if duplicate callsets/samples are passed in (specifically, duplicates between previously imported samples and current ones).

Caveat emptor: We'll overwrite the existing callset file as part of this, and offer no guarantees as to the integrity of the workspace. That is, if incremental import fails for whatever reason, some of the arrays, callset files, etc might be updated while others may not. We don't offer rollback either.

…ering of ID field in vcf

…location

…isting callset. added unit test as well

… warning fixes too

codecov · 2019-04-12T04:51:38Z

Codecov Report

Merging #28 into develop will increase coverage by 1.89%.
The diff coverage is 75.75%.

@@             Coverage Diff             @@
##           develop      #28      +/-   ##
===========================================
+ Coverage    74.22%   76.12%   +1.89%     
===========================================
  Files          113      113              
  Lines        16128    16223      +95     
  Branches       257      267      +10     
===========================================
+ Hits         11971    12349     +378     
+ Misses        4016     3714     -302     
- Partials       141      160      +19

Impacted Files	Coverage Δ
...c/main/java/org/genomicsdb/GenomicsDBUtilsJni.java	`42.85% <ø> (ø)`	⬆️
src/main/java/org/genomicsdb/GenomicsDBUtils.java	`54.54% <100%> (+17.04%)`	⬆️
...nomicsdb/importer/extensions/VidMapExtensions.java	`85.96% <100%> (+0.51%)`	⬆️
...c/main/java/org/genomicsdb/model/ImportConfig.java	`77.67% <100%> (+1.04%)`	⬆️
...ain/java/org/genomicsdb/spark/GenomicsDBInput.java	`73.91% <46.42%> (+38.83%)`	⬆️
...va/org/genomicsdb/importer/GenomicsDBImporter.java	`76.74% <61.76%> (-1.52%)`	⬇️
.../org/genomicsdb/model/CommandLineImportConfig.java	`80.48% <88.88%> (+0.74%)`	⬆️
...csdb/importer/extensions/CallSetMapExtensions.java	`61.33% <88.88%> (+10.45%)`	⬆️
src/main/jni/src/genomicsdb_GenomicsDBUtils.cc	`71.25% <96.66%> (+13.25%)`	⬆️
src/main/cpp/include/loader/load_operators.h	`80.39% <0%> (-7.85%)`	⬇️
... and 18 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0f7db32...25bb181. Read the comment docs.

.travis/scripts/install_spark.sh

src/main/java/org/genomicsdb/importer/GenomicsDBImporter.java

nalinigans · 2019-04-13T04:26:09Z

src/main/java/org/genomicsdb/importer/extensions/CallSetMapExtensions.java

+            if (value != null) {
+                throw new GenomicsDBException("Duplicate sample name found: "+sampleName+". Sample "+
+                        "was originally in "+value);
+            }


It will be useful to get a list of all duplicates before throwing the exception.

I'm following what GATK does here for duplicates within an import - they throw exception on the first duplicate. Do you think a (potentially long) list of duplicates would be useful to the users?

We could stop at some predetermined number. But, only if this feature is useful in the first place. Your call.

src/main/java/org/genomicsdb/model/ImportConfig.java

nalinigans · 2019-04-13T22:26:31Z

Caveat emptor: We'll overwrite the existing callset file as part of this, and offer no guarantees as to the integrity of the workspace. That is, if incremental import fails for whatever reason, some of the arrays, callset files, etc might be updated while others may not. We don't offer rollback either.

Is it possible to save the original callset file and maybe a list with original fragment names before overwriting, basically the state of the filesystem starting at the workspace, even if we don't offer rollback?

mlathara · 2019-04-14T20:13:55Z

Could save the original callset and fragment names - but if we're doing that, we should probably provide a tool that uses that to recover the original workspace as well...

nalinigans · 2019-04-14T21:31:58Z

Agreed, we need a tool to recover the original workspace, but we can start by saving the original callset and fragment names. We should open an issue for writing the tool to recover the original workspace based on what can be gathered from the saved artifacts.

…to incremental

nalinigans

LGTM

mlathara added 15 commits April 1, 2019 10:41

incremntal import changes for genomicsdbimporter

a977e0d

getting rid of legacy spark rdd stuff

8e3ce6f

unit test changes for genomicsdbimporter changes

ce39ddf

changes to make run.py use vcfdiff and avoid test failures due to ord…

394de67

…ering of ID field in vcf

small fixes and cleanup. added additional unit tests

a118018

updating vcfs to fix small typo. header had ### instead of ## in one …

44dbdbe

…location

cleanup and removing dead code

393cd10

CI tests and json to for genomicsdbimporter incremental import

1bc1b49

code to catch case where incremental import duplicate callset from ex…

c47b550

…isting callset. added unit test as well

bug fix for spark getsplits, added regression test. bunch of java doc…

ad53b86

… warning fixes too

switching to apache archive for spark, try to parallelize junits

4a4315d

disable junits for os x travis...takes too long

094fca3

one more try for osx travis

4bb75ad

merge latest develop. split osx travis runs into two to avoid timeout

650bb2e

adding retry logic to mitigate move to apache archive for spark

28b39cc

mlathara requested review from kgururaj and nalinigans April 12, 2019 04:13

nalinigans reviewed Apr 13, 2019

View reviewed changes

.travis/scripts/install_spark.sh Outdated Show resolved Hide resolved

nalinigans reviewed Apr 13, 2019

View reviewed changes

src/main/java/org/genomicsdb/importer/GenomicsDBImporter.java Outdated Show resolved Hide resolved

nalinigans reviewed Apr 13, 2019

View reviewed changes

src/main/java/org/genomicsdb/importer/GenomicsDBImporter.java Show resolved Hide resolved

nalinigans reviewed Apr 13, 2019

View reviewed changes

nalinigans mentioned this pull request Apr 13, 2019

Move to 1.1.0 snapshots and support for azure data lake v2 storage #29

Merged

mlathara added 4 commits April 16, 2019 14:37

resolving PR comments

61b8048

adding fq name for class

f737ef8

Merge branch 'develop' of https://github.com/GenomicsDB/GenomicsDB in…

e9da938

…to incremental

trying to rebalance osx tests

9039e93

one more rebalance

25bb181

nalinigans approved these changes Apr 18, 2019

View reviewed changes

nalinigans merged commit efa1a37 into develop Apr 18, 2019

nalinigans deleted the incremental branch April 18, 2019 17:49

mlathara mentioned this pull request Jul 31, 2019

Add incremental support to GenomicsDBImport #18

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incremental import, plus bug fix and doc/code cleanup #28

Incremental import, plus bug fix and doc/code cleanup #28

mlathara commented Apr 12, 2019

codecov bot commented Apr 12, 2019 •

edited

nalinigans Apr 13, 2019

mlathara Apr 14, 2019

nalinigans Apr 14, 2019

nalinigans commented Apr 13, 2019 •

edited

mlathara commented Apr 14, 2019

nalinigans commented Apr 14, 2019

nalinigans left a comment

Incremental import, plus bug fix and doc/code cleanup #28

Incremental import, plus bug fix and doc/code cleanup #28

Conversation

mlathara commented Apr 12, 2019

codecov bot commented Apr 12, 2019 • edited

Codecov Report

nalinigans Apr 13, 2019

Choose a reason for hiding this comment

mlathara Apr 14, 2019

Choose a reason for hiding this comment

nalinigans Apr 14, 2019

Choose a reason for hiding this comment

nalinigans commented Apr 13, 2019 • edited

mlathara commented Apr 14, 2019

nalinigans commented Apr 14, 2019

nalinigans left a comment

Choose a reason for hiding this comment

codecov bot commented Apr 12, 2019 •

edited

nalinigans commented Apr 13, 2019 •

edited