In [1]:
using JuliaDB
using IndexedTables
using Dagger

using VCFTool

In [6]:
input_dir = "../input/"

vcf_738_1000_file_path = joinpath(input_dir, "738_variants_1000.vcf.gz")

vcf_file_path_to_use = vcf_738_1000_file_path

"../input/738_variants_1000.vcf.gz"

In [7]:
vcf_table = make_vcf_indexedtable(vcf_file_path_to_use)

Table with 999 rows, 10 columns:
Columns:
[1m#   [22m[1mcolname  [22m[1mtype[22m
───────────────────
1   CHROM    Any
2   POS      Any
3   ID       String
4   REF      String
5   ALT      String
6   QUAL     String
7   FILTER   String
8   INFO     String
9   FORMAT   String
10  GERM     String

In [11]:
vcf_ndsparse = make_vcf_ndsparse(vcf_file_path_to_use)

1-d NDSparse with 999 values (10 field named tuples):
    [4mDimensions[24m[1m#  [22m[1mcolname  [22m[1mtype[22m
─────────────────
1  1        Int64
    [4mValues[24m[1m#   [22m[1mcolname  [22m[1mtype[22m
───────────────────
2   CHROM    String
3   POS      String
4   ID       String
5   REF      String
6   ALT      String
7   QUAL     String
8   FILTER   String
9   INFO     String
10  FORMAT   String
11  GERM     String

In [4]:
Dagger.save(vcf_table, "../input/738_variants_1000.vcf.dagger")

Table with 999 rows, 10 columns:
Columns:
[1m#   [22m[1mcolname  [22m[1mtype[22m
───────────────────
1   CHROM    Any
2   POS      Any
3   ID       String
4   REF      String
5   ALT      String
6   QUAL     String
7   FILTER   String
8   INFO     String
9   FORMAT   String
10  GERM     String

In [5]:
using Dagger

vcf_table_dagger = Dagger.load("../input/738_variants_1000.vcf.dagger")

Table with 999 rows, 10 columns:
Columns:
[1m#   [22m[1mcolname  [22m[1mtype[22m
───────────────────
1   CHROM    Any
2   POS      Any
3   ID       String
4   REF      String
5   ALT      String
6   QUAL     String
7   FILTER   String
8   INFO     String
9   FORMAT   String
10  GERM     String

# Dagger load v.s. load from file

Here we compare how long it takes to access a particular variant and a chromosomal region in a regular JuliaDB IndexedTable that was loaded directly from a file versus an index file that was created via `Dagger.save` and loaded here via `Dagger.load`. Using Dagger to save and load a VCF adds another step and dependency to data prep, so this speed test will tell if its worth it.

In [8]:
benchmark_variant = [1, 13868]

benchmark_region = [1, 10000, 100000]

3-element Array{Int64,1}:
      1
  10000
 100000

#### IndexedTable from file

In [9]:
@time filter(i -> (i.CHROM == benchmark_variant[1]) && (i.POS == benchmark_variant[2]), vcf_table)

  4.418017 seconds (6.68 M allocations: 377.287 MiB, 2.02% gc time)


Table with 1 rows, 10 columns:
Columns:
[1m#   [22m[1mcolname  [22m[1mtype[22m
───────────────────
1   CHROM    Any
2   POS      Any
3   ID       String
4   REF      String
5   ALT      String
6   QUAL     String
7   FILTER   String
8   INFO     String
9   FORMAT   String
10  GERM     String

#### Dagger file

In [10]:
@time filter(i -> (i.CHROM == benchmark_variant[1]) && (i.POS == benchmark_variant[2]), vcf_table_dagger)

  0.090228 seconds (81.61 k allocations: 4.295 MiB)


Table with 1 rows, 10 columns:
Columns:
[1m#   [22m[1mcolname  [22m[1mtype[22m
───────────────────
1   CHROM    Any
2   POS      Any
3   ID       String
4   REF      String
5   ALT      String
6   QUAL     String
7   FILTER   String
8   INFO     String
9   FORMAT   String
10  GERM     String

# Access VCF data from NDSParse v.s. IndexedTable

Indexed Table should win.

In [None]:
benchmark_variant_2 = [1, 19322]

benchmark_region_2 = [1, 500000, 800000]

#### NDSparse

In [None]:
@time filter(i -> (i.CHROM == benchmark_variant_2[1]) && (i.POS == benchmark_variant_2[2]), vcf_ndsparse)

#### IndexedTable

In [None]:
@time filter(i -> (i.CHROM == benchmark_variant_2[1]) && (i.POS == benchmark_variant_2[2]), vcf_table)