In [1]:
using JuliaDB
using IndexedTables
using Dagger

using VCFTool

In [24]:
input_dir = "../input/"

vcf_738_1000_file_path = joinpath(input_dir, "738_variants_1000.vcf.gz")

vcf_file_path_to_use = vcf_738_1000_file_path;

In [23]:
vcf_table = make_vcf_indexedtable(vcf_file_path_to_use);

In [15]:
vcf_ndsparse = make_vcf_ndsparse(vcf_file_path_to_use);

In [16]:
Dagger.save(vcf_table, "../input/738_variants_1000.vcf.dagger");

In [17]:
using Dagger

vcf_table_dagger = Dagger.load("../input/738_variants_1000.vcf.dagger");

# Dagger load v.s. load from file

Here we compare how long it takes to access a particular variant and a chromosomal region in a regular JuliaDB IndexedTable that was loaded directly from a file versus an index file that was created via `Dagger.save` and loaded here via `Dagger.load`. Using Dagger to save and load a VCF adds another step and dependency to data prep, so this speed test will tell if its worth it.

In [8]:
benchmark_variant = [1, 13868]

benchmark_region = [1, 10000, 100000]

3-element Array{Int64,1}:
      1
  10000
 100000

## Variant

#### IndexedTable from file

In [18]:
@time filter(i -> (i.CHROM == benchmark_variant[1]) && (i.POS == benchmark_variant[2]), vcf_table);

  0.102539 seconds (81.61 k allocations: 4.294 MiB)


#### Dagger file

In [19]:
@time filter(i -> (i.CHROM == benchmark_variant[1]) && (i.POS == benchmark_variant[2]), vcf_table_dagger);

  0.102654 seconds (81.62 k allocations: 4.295 MiB)


## Region

#### IndexedTable from file

In [28]:
@time filter(i -> (i.CHROM == benchmark_region[1]) && (i.POS > benchmark_region[2]) && (i.POS < benchmark_region[3]), vcf_table);

  0.105188 seconds (84.73 k allocations: 4.462 MiB)


#### Dagger file

In [30]:
@time filter(i -> (i.CHROM == benchmark_region[1]) && (i.POS > benchmark_region[2]) && (i.POS < benchmark_region[3]), vcf_table_dagger);

  0.105930 seconds (84.73 k allocations: 4.461 MiB)


# Access VCF data from NDSParse v.s. IndexedTable

Indexed Table should win.

In [20]:
benchmark_variant_2 = [1, 19322]

benchmark_region_2 = [1, 500000, 800000]

3-element Array{Int64,1}:
      1
 500000
 800000

## Variant

#### NDSparse

In [42]:
@time filter(i -> (i.CHROM == benchmark_variant_2[1]) && (i.POS == benchmark_variant_2[2]), vcf_ndsparse)

  0.114877 seconds (81.05 k allocations: 4.354 MiB)


2-d NDSparse with 0 values (10 field named tuples):
CHROM  POS │ CHROM  POS  ID  REF  ALT  QUAL  FILTER  INFO
───────────┼─────────────────────────────────────────────

#### IndexedTable

In [43]:
@time filter(i -> (i.CHROM == benchmark_variant_2[1]) && (i.POS == benchmark_variant_2[2]), vcf_table)

  0.102852 seconds (81.61 k allocations: 4.301 MiB)


Table with 1 rows, 10 columns:
[1mCHROM  [22m[1mPOS    [22mID   REF  ALT  QUAL  FILTER                       INFO               FORMAT                                  GERM
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1      19322  "."  "C"  "T"  "0"   "LowGQX;NoPassedVariantGTs"  "SNVHPOL=3;MQ=13"  "GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL"  "0/1:15:0:17:0:14,3:10,1:4,2:0.0:LowGQX:17,0,146"

## Region

#### NDSparse

In [46]:
@time filter(i -> (i.CHROM == benchmark_region[1]) && (i.POS > benchmark_region[2]) && (i.POS < benchmark_region[3]), vcf_ndsparse)

  0.108096 seconds (82.02 k allocations: 4.396 MiB)


2-d NDSparse with 0 values (10 field named tuples):
CHROM  POS │ CHROM  POS  ID  REF  ALT  QUAL  FILTER  INFO
───────────┼─────────────────────────────────────────────

#### IndexedTable

In [47]:
@time filter(i -> (i.CHROM == benchmark_region[1]) && (i.POS > benchmark_region[2]) && (i.POS < benchmark_region[3]), vcf_table)

  0.105148 seconds (84.73 k allocations: 4.467 MiB)


Table with 95 rows, 10 columns:
[1mCHROM  [22m[1mPOS    [22mID   REF   ALT      QUAL   FILTER                                INFO                                           FORMAT                                  GERM
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1      10439  "."  "AC"  "A"      "72"   "PASS"                                "CIGAR=1M1D;RU=C;REFREP=4;IDREP=3;MQ=9"        "GT:GQ:GQX:DPI:AD:ADF:ADR:FT:PL"        "0/1:31:5:7:2,5:0,1:2,4:PASS:108,0,28"
1      13284  "."  "G"   "A"      "60"   "LowGQX;NoPassedVariantGTs"           "SNVHPOL=4;MQ=17"                              "GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL"  "0/1:93:1:30:1:19,11:12,8:7,3:-9.2:LowGQX:95,0,157"
1      13868  "."  "A"   "G"      "1"    "LowGQX;LowDepth;NoPassedVariantGTs"  "SNVHPOL=3;MQ=4"                               "GT:GQ: