![](https://juliacomputing.com/assets/img/new/JuliaDB_logo2.svg)

# Intro to JuliaDB

JuliaDB is an analytical database.  Among its highlights are its abilities to

- **Load multi-file datasets**
- **Index the data and perform filter, aggregate, sort and join operations**
- **Compile queries**
- **Store any data type**
- **Save results and load them efficiently later**
- **Use Julia's built-in parallelism to fully utilize any machine or cluster**
- **Integrate with OnlineStats for analytics**

# Helpful links

For additional documentation, please refer to

- [JuliaDB API Reference](http://juliadb.org/latest/api/index.html)
- [OnlineStats Docs](http://joshday.github.io/OnlineStats.jl/stable/)

# Let's Get Started

In this notebook, we'll introduce you to JuliaDB's two main data structures, `Table` and `NDSparse` while

1. Creating tables from Julia Vectors
1. Accessing data from `Table` and `NDSparse`
1. Loading tables from CSVs
1. Saving tables into binary format
1. Reloading a saved table
1. Leveraging selectors

The first time you work with JuliaDB (off JuliaBox),  you'll want to run `Pkg.add("JuliaDB")`.

For today, we can get started by loading JuliaDB via

In [1]:
using JuliaDB

# Creating tables from Julia Vectors

In this section, we'll use Julia Vectors to generate JuliaDB's two main data structures,

- `Table`
- `NDSparse`

Performance is similar between `Table` and `NDSparse`, so we'll discuss how to choose which one.

Let's create some vectors `x`, `y`, and `z` that we can use to generate `Table` and `NDSparse` data structures.

In [2]:
x = [false, true, false, true]
y = ['B', 'B', 'A', 'A']
z = randn(4);

### `Table`: sorted by primary key(s)

We can create a table in JuliaDB using the `table` function

In [3]:
t1 = table(x, y, z)

Table with 4 rows, 3 columns:
1      2    3
─────────────────────
false  'B'  1.56161
true   'B'  -0.228114
false  'A'  0.304639
true   'A'  -1.61879

Here the vectors `x`, `y`, and `z` became columns of the output table. By default, these columns are numbered `1`, `2`, and `3`. 

We can use the keyword argument `names` to label the columns.

In [4]:
t2 = table(x,  y, z, names = [:x, :y, :z])

Table with 4 rows, 3 columns:
x      y    z
─────────────────────
false  'B'  1.56161
true   'B'  -0.228114
false  'A'  0.304639
true   'A'  -1.61879

Furthermore, we can **sort** a table using the keyword argument `pkey`. `pkey` specifies the primary keys by which the output table will be sorted.

In [5]:
t3 = table(x, y, z, names = [:x, :y, :z], pkey=(:x, :y))

Table with 4 rows, 3 columns:
[1mx      [22m[1my    [22mz
─────────────────────
false  'A'  0.304639
false  'B'  1.56161
true   'A'  -1.61879
true   'B'  -0.228114

In this example, the rows of `t3` are sorted first by the values in column x and second by the values in column y.

In this next example, `t4` is sorted by the values in column z.

In [6]:
t4 = table(x, y, z, names = [:x, :y, :z], pkey=:z)

Table with 4 rows, 3 columns:
x      y    [1mz[22m
─────────────────────
true   'A'  -1.61879
true   'B'  -0.228114
false  'A'  0.304639
false  'B'  1.56161

We can also build a `table` from a **"named tuple"**.

In regular tuples, we can grab the entries of a tuple by indexing

In [7]:
tuple1 = ("Josh", "Day")
tuple1[1]

"Josh"

We can create a named tuple using the `@NT` command. We now have the option to bind the elements of our tuple to names that we can use to access those elements.

In [8]:
tuple2 = @NT(firstname = "Josh", lastname = "Day")

(firstname = "Josh", lastname = "Day")

In `tuple2`, the first element, "Josh", is now named "firstname". We can now use this name to access "Josh".

In [9]:
tuple2.firstname

"Josh"

In [10]:
tuple2[:firstname]

"Josh"

If we use a named tuple to create a `table` from the vectors `x`, `y`, and `z`, we don't need to explicitly name the columns of the resulting table with the `names` keyword argument:

In [11]:
t5 = table(@NT(x=x, y=y, z=z), pkey = [:x, :y])

Table with 4 rows, 3 columns:
[1mx      [22m[1my    [22mz
─────────────────────
false  'A'  0.304639
false  'B'  1.56161
true   'A'  -1.61879
true   'B'  -0.228114

### `NDSparse`: N-dimensional Sparse Array

The function `ndsparse` can be used to create a `NDSparse` object, which is designed to store data values that are sparse over the domain of index values.

Like `Table`s, `NDSparse` objects can be created with or without the use of named tuples.

In [12]:
nd1 = ndsparse((x, y), z)

2-d NDSparse with 4 values (Float64):
1      2   │
───────────┼──────────
false  'A' │ 0.304639
false  'B' │ 1.56161
true   'A' │ -1.61879
true   'B' │ -0.228114

NDSparse maps tuples of indices of arbitrary types to values, just like an Array maps tuples of integer indices to values. 

In `nd1` and `nd2`, the indices are shown to the left of the vertical line, while the values they map to are to the right.

In [13]:
nd2 = ndsparse(@NT(x=x, y=y), @NT(z=z))

2-d NDSparse with 4 values (1 field named tuples):
x      y   │ z
───────────┼──────────
false  'A' │ 0.304639
false  'B' │ 1.56161
true   'A' │ -1.61879
true   'B' │ -0.228114

# Accessing data from `Table` and `NDSparse`

In the last section, we created the `Table`, `t3`:

In [14]:
t3

Table with 4 rows, 3 columns:
[1mx      [22m[1my    [22mz
─────────────────────
false  'A'  0.304639
false  'B'  1.56161
true   'A'  -1.61879
true   'B'  -0.228114

We can grab the first row of `t3` by indexing into `t3`

In [15]:
t3[1]

(x = false, y = 'A', z = 0.3046387127952048)

and we can grab any of the elements in that row using the name of the associated columns

In [16]:
t3[1].z

0.3046387127952048

The syntax for performing a lookup in a `NDSparse` object is slightly different.

For example, if we want to look up all the rows of `nd` where `x` is `false`, we would write

In [17]:
nd2[false, :]

1-d NDSparse with 2 values (1 field named tuples):
y   │ z
────┼─────────
'A' │ 0.304639
'B' │ 1.56161

If we want to look up the rows of `nd` where x is `false` and `y` is `A`, we would write

In [18]:
nd2[false, 'A']

(z = 0.3046387127952048)

and we can grab the value of `z` associated with this entry via

In [19]:
nd2[false, 'A'].z

0.3046387127952048

The entries in a `Table` are the rows of that table, whereas the entries in a `NDSparse` are the values to which the indices in `NDSparse` map. 

Note the differences in the next two examples

In [20]:
# Table: iterate over NamedTuples of rows
for row in t3
    println(row)
end

(x = false, y = 'A', z = 0.3046387127952048)
(x = false, y = 'B', z = 1.561606495251595)
(x = true, y = 'A', z = -1.618791109249178)
(x = true, y = 'B', z = -0.22811446805049151)


In [21]:
# NDSparse: iterate over NamedTuples of values
for item in nd2
    println(item)
end

(z = 0.3046387127952048)
(z = 1.561606495251595)
(z = -1.618791109249178)
(z = -0.22811446805049151)


# Loading tables from CSVs

In the next example, we'll load data from a CSV file into a table.

TO do this, we'll work with 8 different stocks' OHLC data up to November 2017.

We can start by looking at all the available files in our current directory. To do this, we can type `;ls` where `;` allows us to execute shell commands from within a jupyter notebook or Julia REPL.

In [22]:
;ls

1. JuliaDB Basics.ipynb
2. Table Usage.ipynb
3. NDSparse Usage.ipynb
4. Distributed Data.ipynb
5. OnlineStats Integration.ipynb
diamonds.csv
languages.png
stocks
stocksample


The data we want are the files in the stocksample directory

In [23]:
;ls stocksample

aapl.us.txt
amzn.us.txt
dis.us.txt
googl.us.txt
ibm.us.txt
msft.us.txt
nflx.us.txt
tsla.us.txt


all of which have the same structure

In [24]:
;head stocksample/aapl.us.txt

Date,Open,High,Low,Close,Volume,OpenInt
1984-09-07,0.42388,0.42902,0.41874,0.42388,23220030,0
1984-09-10,0.42388,0.42516,0.41366,0.42134,18022532,0
1984-09-11,0.42516,0.43668,0.42516,0.42902,42498199,0
1984-09-12,0.42902,0.43157,0.41618,0.41618,37125801,0
1984-09-13,0.43927,0.44052,0.43927,0.43927,57822062,0
1984-09-14,0.44052,0.45589,0.44052,0.44566,68847968,0
1984-09-17,0.45718,0.46357,0.45718,0.45718,53755262,0
1984-09-18,0.45718,0.46103,0.44052,0.44052,27136886,0
1984-09-19,0.44052,0.44566,0.43157,0.43157,29641922,0


We can now use the `loadtable` function to load all files from the directory "stocksample" into one table

In [25]:
stocks = loadtable("stocksample")

Table with 56023 rows, 7 columns:
Date        Open     High     Low      Close    Volume    OpenInt
─────────────────────────────────────────────────────────────────
1984-09-07  0.42388  0.42902  0.41874  0.42388  23220030  0
1984-09-10  0.42388  0.42516  0.41366  0.42134  18022532  0
1984-09-11  0.42516  0.43668  0.42516  0.42902  42498199  0
1984-09-12  0.42902  0.43157  0.41618  0.41618  37125801  0
1984-09-13  0.43927  0.44052  0.43927  0.43927  57822062  0
1984-09-14  0.44052  0.45589  0.44052  0.44566  68847968  0
1984-09-17  0.45718  0.46357  0.45718  0.45718  53755262  0
1984-09-18  0.45718  0.46103  0.44052  0.44052  27136886  0
1984-09-19  0.44052  0.44566  0.43157  0.43157  29641922  0
1984-09-20  0.43286  0.43668  0.43286  0.43286  18453585  0
1984-09-21  0.43286  0.44566  0.42388  0.42902  27842780  0
1984-09-24  0.42902  0.43157  0.42516  0.42516  22033109  0
⋮
2017-10-27  319.75   324.59   316.66   320.87   6970118   0
2017-10-30  319.18   323.78   317.25   320.08   4254

Now all the data from all 8 files has been thrown into one table, but there's no way to sort by stock/by file.

To do this, we could have created `stocks` adding a field for the file name called `Ticker` via

In [26]:
stocks = loadtable("stocksample"; filenamecol = :Ticker)

Table with 56023 rows, 8 columns:
Columns:
[1m#  [22m[1mcolname  [22m[1mtype[22m
───────────────────
1  Ticker   String
2  Date     Date
3  Open     Float64
4  High     Float64
5  Low      Float64
6  Close    Float64
7  Volume   Int64
8  OpenInt  Int64

Furthermore, we can choose how data is sorted within our `Table` by specifying the keyword argument `indexcols`

In [27]:
stocks = loadtable("stocksample"; filenamecol = :Ticker, indexcols = [:Ticker, :Date])

Table with 56023 rows, 8 columns:
Columns:
[1m#  [22m[1mcolname  [22m[1mtype[22m
───────────────────
1  Ticker   String
2  Date     Date
3  Open     Float64
4  High     Float64
5  Low      Float64
6  Close    Float64
7  Volume   Int64
8  OpenInt  Int64

JuliaDB prints a column summary if the table would print wider than the display (hardcoded in Jupyter as 80 characters).  However, we can override this:

In [28]:
IndexedTables.set_show_compact!(false)

stocks

Table with 56023 rows, 8 columns:
[1mTicker         [22m[1mDate        [22mOpen     High     Low      Close    Volume    OpenInt
────────────────────────────────────────────────────────────────────────────────
"aapl.us.txt"  1984-09-07  0.42388  0.42902  0.41874  0.42388  23220030  0
"aapl.us.txt"  1984-09-10  0.42388  0.42516  0.41366  0.42134  18022532  0
"aapl.us.txt"  1984-09-11  0.42516  0.43668  0.42516  0.42902  42498199  0
"aapl.us.txt"  1984-09-12  0.42902  0.43157  0.41618  0.41618  37125801  0
"aapl.us.txt"  1984-09-13  0.43927  0.44052  0.43927  0.43927  57822062  0
"aapl.us.txt"  1984-09-14  0.44052  0.45589  0.44052  0.44566  68847968  0
"aapl.us.txt"  1984-09-17  0.45718  0.46357  0.45718  0.45718  53755262  0
"aapl.us.txt"  1984-09-18  0.45718  0.46103  0.44052  0.44052  27136886  0
"aapl.us.txt"  1984-09-19  0.44052  0.44566  0.43157  0.43157  29641922  0
"aapl.us.txt"  1984-09-20  0.43286  0.43668  0.43286  0.43286  18453585  0
"aapl.us.txt"  1984-09-21  0.43286  

Now the entries of `stocks` are arranged by file.

Let's look at the first entry of `stocks`:

In [29]:
stocks[1]

(Ticker = "aapl.us.txt", Date = 1984-09-07, Open = 0.42388, High = 0.42902, Low = 0.41874, Close = 0.42388, Volume = 23220030, OpenInt = 0)

Using some shell commands, we can see that "aapl.us.txt" has 8365 lines (including a header), 

In [30]:
;wc -l stocksample/aapl.us.txt

    8365 stocksample/aapl.us.txt


so we would expect the first 8364 entries of `stocks` to contain data from "aapl.us.txt":

In [31]:
stocks[8364].Ticker

"aapl.us.txt"

In [32]:
stocks[8365].Ticker

"amzn.us.txt"

Note that the secondary index by which entries are sorted is "Date". Therefore, for entries from the same file, `stocks[n + 1]` refers to an entry with a date after that of `stocks[n]` for integer `n`.

In [33]:
stocks[1].Date

1984-09-07

In [34]:
stocks[2].Date

1984-09-10

In contrast, we could have chosen to arrange entries by the file from which they came and then by their "high" values.

In [35]:
stocks2 = loadtable("stocksample"; filenamecol = :Ticker, indexcols = [:Ticker, :High])

Table with 56023 rows, 8 columns:
[1mTicker         [22m[1mHigh     [22mDate        Open     Low      Close    Volume    OpenInt
────────────────────────────────────────────────────────────────────────────────
"aapl.us.txt"  0.23564  1985-08-15  0.23305  0.23051  0.23051  29335821  0
"aapl.us.txt"  0.2369   1985-08-16  0.23305  0.23305  0.23305  23357459  0
"aapl.us.txt"  0.23949  1985-06-17  0.2369   0.2369   0.2369   65911889  0
"aapl.us.txt"  0.23949  1985-08-23  0.2369   0.23564  0.23564  12275315  0
"aapl.us.txt"  0.23949  1985-08-30  0.23949  0.23949  0.23949  11956721  0
"aapl.us.txt"  0.23949  1985-09-03  0.23949  0.23564  0.23564  10444953  0
"aapl.us.txt"  0.23949  1985-09-05  0.2369   0.2369   0.2369   9151833   0
"aapl.us.txt"  0.23949  1985-09-06  0.23949  0.23949  0.23949  25881231  0
"aapl.us.txt"  0.24202  1985-08-26  0.24202  0.24202  0.24202  9945186   0
"aapl.us.txt"  0.24202  1985-09-04  0.2369   0.2369   0.2369   13262346  0
"aapl.us.txt"  0.24202  1985-10-08  

For `stocks2` the first 8364 entries will be sorted by their "High" values rather than by date

In [36]:
stocks2[11].High

0.24202

In [37]:
stocks2[12].High

0.24333

In [38]:
stocks2[11].Date

1985-10-08

In [39]:
stocks2[12].Date

1985-08-08

### What was Apple's closing price on 1986-02-10?

- For a `Table` object, this requires a query
- With a `NDSparse` object, we can just `getindex`

In [40]:
stocksnd = loadndsparse("stocksample", filenamecol=:Ticker, indexcols = [:Ticker, :Date])

stocksnd["aapl.us.txt", Date(1986, 2, 10)]

(Open = 0.38289, High = 0.39186, Low = 0.37906, Close = 0.38164, Volume = 31191161, OpenInt = 0)

# Saving tables into binary format

To save a table in binary format, we can simply use the `save` function.

`save` takes the table we'd like to save as its first input, and the saving destination as its second input.

In [41]:
save(stocks, "stocks")

Table with 56023 rows, 8 columns:
[1mTicker         [22m[1mDate        [22mOpen     High     Low      Close    Volume    OpenInt
────────────────────────────────────────────────────────────────────────────────
"aapl.us.txt"  1984-09-07  0.42388  0.42902  0.41874  0.42388  23220030  0
"aapl.us.txt"  1984-09-10  0.42388  0.42516  0.41366  0.42134  18022532  0
"aapl.us.txt"  1984-09-11  0.42516  0.43668  0.42516  0.42902  42498199  0
"aapl.us.txt"  1984-09-12  0.42902  0.43157  0.41618  0.41618  37125801  0
"aapl.us.txt"  1984-09-13  0.43927  0.44052  0.43927  0.43927  57822062  0
"aapl.us.txt"  1984-09-14  0.44052  0.45589  0.44052  0.44566  68847968  0
"aapl.us.txt"  1984-09-17  0.45718  0.46357  0.45718  0.45718  53755262  0
"aapl.us.txt"  1984-09-18  0.45718  0.46103  0.44052  0.44052  27136886  0
"aapl.us.txt"  1984-09-19  0.44052  0.44566  0.43157  0.43157  29641922  0
"aapl.us.txt"  1984-09-20  0.43286  0.43668  0.43286  0.43286  18453585  0
"aapl.us.txt"  1984-09-21  0.43286  

# Reloading a saved table

To load a saved table, we use the `load` function (rather than `loadtable`, which we saw earlier).

In [42]:
reloaded_stocks = load("stocks")

Table with 56023 rows, 8 columns:
[1mTicker         [22m[1mDate        [22mOpen     High     Low      Close    Volume    OpenInt
────────────────────────────────────────────────────────────────────────────────
"aapl.us.txt"  1984-09-07  0.42388  0.42902  0.41874  0.42388  23220030  0
"aapl.us.txt"  1984-09-10  0.42388  0.42516  0.41366  0.42134  18022532  0
"aapl.us.txt"  1984-09-11  0.42516  0.43668  0.42516  0.42902  42498199  0
"aapl.us.txt"  1984-09-12  0.42902  0.43157  0.41618  0.41618  37125801  0
"aapl.us.txt"  1984-09-13  0.43927  0.44052  0.43927  0.43927  57822062  0
"aapl.us.txt"  1984-09-14  0.44052  0.45589  0.44052  0.44566  68847968  0
"aapl.us.txt"  1984-09-17  0.45718  0.46357  0.45718  0.45718  53755262  0
"aapl.us.txt"  1984-09-18  0.45718  0.46103  0.44052  0.44052  27136886  0
"aapl.us.txt"  1984-09-19  0.44052  0.44566  0.43157  0.43157  29641922  0
"aapl.us.txt"  1984-09-20  0.43286  0.43668  0.43286  0.43286  18453585  0
"aapl.us.txt"  1984-09-21  0.43286  

In [43]:
stocks == reloaded_stocks

true

Note that with JuliaDB, we can load large files quickly

In [44]:
# This timing is more dramatic with larger datasets
@time load("stocks")

  0.016667 seconds (279.94 k allocations: 6.434 MiB)


Table with 56023 rows, 8 columns:
[1mTicker         [22m[1mDate        [22mOpen     High     Low      Close    Volume    OpenInt
────────────────────────────────────────────────────────────────────────────────
"aapl.us.txt"  1984-09-07  0.42388  0.42902  0.41874  0.42388  23220030  0
"aapl.us.txt"  1984-09-10  0.42388  0.42516  0.41366  0.42134  18022532  0
"aapl.us.txt"  1984-09-11  0.42516  0.43668  0.42516  0.42902  42498199  0
"aapl.us.txt"  1984-09-12  0.42902  0.43157  0.41618  0.41618  37125801  0
"aapl.us.txt"  1984-09-13  0.43927  0.44052  0.43927  0.43927  57822062  0
"aapl.us.txt"  1984-09-14  0.44052  0.45589  0.44052  0.44566  68847968  0
"aapl.us.txt"  1984-09-17  0.45718  0.46357  0.45718  0.45718  53755262  0
"aapl.us.txt"  1984-09-18  0.45718  0.46103  0.44052  0.44052  27136886  0
"aapl.us.txt"  1984-09-19  0.44052  0.44566  0.43157  0.43157  29641922  0
"aapl.us.txt"  1984-09-20  0.43286  0.43668  0.43286  0.43286  18453585  0
"aapl.us.txt"  1984-09-21  0.43286  

# Leveraging Selectors

### Selectors are powerful ways to select and manipulate data

1.  `Integer`: column at position
2. `Symbol`: column by name
3. `Array`: itself
4. `Pair{Selection => Function}`: function mapped to selection
5. `Tuple` of selections: table of each selection

### Selectors show up everywhere
<code>select(t, <span style="color: green">which</span>)
map(f, t; <span style="color: green">select</span>)
reduce(f, t; <span style="color: green">select</span>)
filter(f, t; <span style="color: green">select</span>)
groupby(f, t, <span style="color: green">by</span>; <span style="color: green">select</span>)
groupreduce(f, t, <span style="color: green">by</span>; <span style="color: green">select</span>)
join(f, l, r; how, <span style="color: green">lkey</span>, <span style="color: green">rkey</span>, <span style="color: green">lselect</span>, <span style="color: green">rselect</span>)
groupjoin(f, l, r; how, <span style="color: green">lkey</span>, <span style="color: green">rkey</span>, <span style="color: green">lselect</span>, <span style="color: green">rselect</span>)
</code>

In [45]:
# Try selecting with Integer, Symbol, Pair, Tuple
select(stocks, 1)

56023-element Array{String,1}:
 "aapl.us.txt"
 "aapl.us.txt"
 "aapl.us.txt"
 "aapl.us.txt"
 "aapl.us.txt"
 "aapl.us.txt"
 "aapl.us.txt"
 "aapl.us.txt"
 "aapl.us.txt"
 "aapl.us.txt"
 "aapl.us.txt"
 "aapl.us.txt"
 "aapl.us.txt"
 ⋮            
 "tsla.us.txt"
 "tsla.us.txt"
 "tsla.us.txt"
 "tsla.us.txt"
 "tsla.us.txt"
 "tsla.us.txt"
 "tsla.us.txt"
 "tsla.us.txt"
 "tsla.us.txt"
 "tsla.us.txt"
 "tsla.us.txt"
 "tsla.us.txt"

In [46]:
select(stocks, :Date)

56023-element Array{Date,1}:
 1984-09-07
 1984-09-10
 1984-09-11
 1984-09-12
 1984-09-13
 1984-09-14
 1984-09-17
 1984-09-18
 1984-09-19
 1984-09-20
 1984-09-21
 1984-09-24
 1984-09-25
 ⋮         
 2017-10-26
 2017-10-27
 2017-10-30
 2017-10-31
 2017-11-01
 2017-11-02
 2017-11-03
 2017-11-06
 2017-11-07
 2017-11-08
 2017-11-09
 2017-11-10

In [47]:
select(stocks, (1, :Date))

Table with 56023 rows, 2 columns:
[1mTicker         [22m[1mDate[22m
─────────────────────────
"aapl.us.txt"  1984-09-07
"aapl.us.txt"  1984-09-10
"aapl.us.txt"  1984-09-11
"aapl.us.txt"  1984-09-12
"aapl.us.txt"  1984-09-13
"aapl.us.txt"  1984-09-14
"aapl.us.txt"  1984-09-17
"aapl.us.txt"  1984-09-18
"aapl.us.txt"  1984-09-19
"aapl.us.txt"  1984-09-20
"aapl.us.txt"  1984-09-21
"aapl.us.txt"  1984-09-24
⋮
"tsla.us.txt"  2017-10-27
"tsla.us.txt"  2017-10-30
"tsla.us.txt"  2017-10-31
"tsla.us.txt"  2017-11-01
"tsla.us.txt"  2017-11-02
"tsla.us.txt"  2017-11-03
"tsla.us.txt"  2017-11-06
"tsla.us.txt"  2017-11-07
"tsla.us.txt"  2017-11-08
"tsla.us.txt"  2017-11-09
"tsla.us.txt"  2017-11-10