<img src="https://juliacomputing.com/assets/img/new/JuliaDB_logo2.svg", width=200>

# JuliaDB is an analytical database that...

- **Loads multi-file datasets**
- **Sorts data by index variables for fast filter, aggregation, sort, and join operations**
- **Compiles queries** (it's Julia all the way down)
- **Stores ANY data type** (again, it's Julia)
- **Saves tables to disk for fast reloading**
- **Uses Julia's built-in parallel computing features to fully utilize any machine or cluster**
- **Integrates with [OnlineStats](https://github.com/joshday/OnlineStats.jl) for big (or small) data analytics**

# Helpful links

For additional documentation, please refer to

- [JuliaDB Docs](http://juliadb.org/latest/api/index.html)
- [OnlineStats Docs](http://joshday.github.io/OnlineStats.jl/stable/)

# Getting Started

In this notebook, we'll introduce **JuliaDB**'s two main data structures (`Table` and `NDSparse`) by

1. Creating tables from vectors
1. Accessing data from `Table` and `NDSparse`
1. Loading tables from CSVs
1. Saving tables into binary format
1. Reloading a saved table
1. Using selectors

In [1]:
using JuliaDB

[1m[36mINFO: [39m[22m[36mRecompiling stale cache file /Users/joshday/.julia/lib/v0.6/CodecZlib.ji for module CodecZlib.
[39m[1m[36mINFO: [39m[22m[36mRecompiling stale cache file /Users/joshday/.julia/lib/v0.6/JuliaDB.ji for module JuliaDB.
[39m

# Creating tables from Julia Vectors

- First we'll make some data vectors 

In [2]:
x = [false, true, false, true]
y = ['B', 'B', 'A', 'A']
z = [.1, .3, .2, .4]
x, y, z

(Bool[false, true, false, true], ['B', 'B', 'A', 'A'], [0.1, 0.3, 0.2, 0.4])

## `Table`: Tabular Data Sorted by Primary Key(s)

- We can create a table in **JuliaDB** using the `table` function.

In [3]:
# x, y, z become columns of the table, with default numbering 1, 2, 3.

t1 = table(x, y, z)

Table with 4 rows, 3 columns:
1      2    3
───────────────
false  'B'  0.1
true   'B'  0.3
false  'A'  0.2
true   'A'  0.4

In [4]:
# The keyword argument `names` lets us label the columns

t2 = table(x,  y, z, names = [:x, :y, :z])

Table with 4 rows, 3 columns:
x      y    z
───────────────
false  'B'  0.1
true   'B'  0.3
false  'A'  0.2
true   'A'  0.4

### Sorting by Index Variables
- Furthermore, we can **sort** a table by a primary key (or keys), which can be set with the keyword argument `pkey`.
- Below, the rows of `t3` are sorted first by the values in column `x` and second by the values in column `y`.

In [5]:
t3 = table(x, y, z, names = [:x, :y, :z], pkey = (:x, :y))

Table with 4 rows, 3 columns:
[1mx      [22m[1my    [22mz
───────────────
false  'A'  0.2
false  'B'  0.1
true   'A'  0.4
true   'B'  0.3

### Tuple vs. NamedTuple

- We can also build a `table` from a `NamedTuple` of vectors.
- A `NamedTuple` is created with the `@NT` macro.  Values can be accessed by position or name.

In [6]:
# Tuple

a = ("John", "Doe")

@show a[1];

a[1] = "John"


In [7]:
# NamedTuple

b = @NT(firstname = "John", lastname = "Doe")

@show b[1]
@show b[:firstname]
@show b.firstname;

b[1] = "John"
b[:firstname] = "John"
b.firstname = "John"


In [8]:
# Column names are taken from the NamedTuple

t4 = table(@NT(x=x, y=y, z=z), pkey = [:x, :y])

Table with 4 rows, 3 columns:
[1mx      [22m[1my    [22mz
───────────────
false  'A'  0.2
false  'B'  0.1
true   'A'  0.4
true   'B'  0.3

### `NDSparse`: N-dimensional Sparse Array with Arbitrary Indexes

- Compare the following two data structures:

In [9]:
# maps (tuple of integers) -> value

sparse(reshape(z, 2, 2))

2×2 SparseMatrixCSC{Float64,Int64} with 4 stored entries:
  [1, 1]  =  0.1
  [2, 1]  =  0.3
  [1, 2]  =  0.2
  [2, 2]  =  0.4

In [10]:
# maps (tuple of arbitrary index types) -> value

ndsparse((x, y), z)

2-d NDSparse with 4 values (Float64):
1      2   │
───────────┼────
false  'A' │ 0.2
false  'B' │ 0.1
true   'A' │ 0.4
true   'B' │ 0.3

- Like `table`, we can use `ndsparse` with NamedTuples:

In [11]:
nd2 = ndsparse(@NT(x=x, y=y), z)

2-d NDSparse with 4 values (Float64):
x      y   │
───────────┼────
false  'A' │ 0.2
false  'B' │ 0.1
true   'A' │ 0.4
true   'B' │ 0.3

---

# Accessing data from `Table` and `NDSparse`

## Index into `Table`

- In the last section, we created the `Table`, `t3`:

In [12]:
t3

Table with 4 rows, 3 columns:
[1mx      [22m[1my    [22mz
───────────────
false  'A'  0.2
false  'B'  0.1
true   'A'  0.4
true   'B'  0.3

- We can get the first row of `t3` by indexing:

In [13]:
t3[1]

(x = false, y = 'A', z = 0.2)

- We can get an individual element in the row by using the column name:

In [14]:
t3[1].z

0.2

## Index into `NDSparse`

- Since `NDSparse` acts like an array, accessing data is slightly different.
- We must look up a value by using the index columns:

In [15]:
nd2[false, 'A']

0.2

- Since `NDSparse` acts like an array, we can look up all the rows where `x` is `false`:

In [16]:
nd2[false, :]

1-d NDSparse with 2 values (Float64):
y   │
────┼────
'A' │ 0.2
'B' │ 0.1

## Iteration

- `Table` and `NDSparse` also differ in how they iterate over the data. 

In [17]:
# Table: iterate over Tuples/NamedTuples of rows

for row in t3
    println(row)
end

(x = false, y = 'A', z = 0.2)
(x = false, y = 'B', z = 0.1)
(x = true, y = 'A', z = 0.4)
(x = true, y = 'B', z = 0.3)


In [18]:
# NDSparse: iterate over values

for item in nd2
    println(item)
end

0.2
0.1
0.4
0.3


---

# Loading tables from CSVs

- Now we'll load data from multiple CSV files into a table.
- Our example data is 8 different stocks' OHLC data (each stock in a separate file).
    - We can start by looking at all the available files in the `stocksample` directory.
    - `;` allows us to run shell commands from Jupyter or a Julia REPL.
    

In [19]:
;ls stocksample

aapl.us.txt
amzn.us.txt
dis.us.txt
googl.us.txt
ibm.us.txt
msft.us.txt
nflx.us.txt
tsla.us.txt


- Each txt file has the structure:

In [20]:
;head stocksample/aapl.us.txt

Date,Open,High,Low,Close,Volume,OpenInt
1984-09-07,0.42388,0.42902,0.41874,0.42388,23220030,0
1984-09-10,0.42388,0.42516,0.41366,0.42134,18022532,0
1984-09-11,0.42516,0.43668,0.42516,0.42902,42498199,0
1984-09-12,0.42902,0.43157,0.41618,0.41618,37125801,0
1984-09-13,0.43927,0.44052,0.43927,0.43927,57822062,0
1984-09-14,0.44052,0.45589,0.44052,0.44566,68847968,0
1984-09-17,0.45718,0.46357,0.45718,0.45718,53755262,0
1984-09-18,0.45718,0.46103,0.44052,0.44052,27136886,0
1984-09-19,0.44052,0.44566,0.43157,0.43157,29641922,0


- We can now use the `loadtable` function to load all files in the `stocksample` directory into one table:

In [21]:
stocks = loadtable("stocksample")

Table with 56023 rows, 7 columns:
Date        Open     High     Low      Close    Volume    OpenInt
─────────────────────────────────────────────────────────────────
1984-09-07  0.42388  0.42902  0.41874  0.42388  23220030  0
1984-09-10  0.42388  0.42516  0.41366  0.42134  18022532  0
1984-09-11  0.42516  0.43668  0.42516  0.42902  42498199  0
1984-09-12  0.42902  0.43157  0.41618  0.41618  37125801  0
1984-09-13  0.43927  0.44052  0.43927  0.43927  57822062  0
1984-09-14  0.44052  0.45589  0.44052  0.44566  68847968  0
1984-09-17  0.45718  0.46357  0.45718  0.45718  53755262  0
1984-09-18  0.45718  0.46103  0.44052  0.44052  27136886  0
1984-09-19  0.44052  0.44566  0.43157  0.43157  29641922  0
1984-09-20  0.43286  0.43668  0.43286  0.43286  18453585  0
1984-09-21  0.43286  0.44566  0.42388  0.42902  27842780  0
1984-09-24  0.42902  0.43157  0.42516  0.42516  22033109  0
⋮
2017-10-27  319.75   324.59   316.66   320.87   6970118   0
2017-10-30  319.18   323.78   317.25   320.08   4254

- All data from the 8 files is now in one table, but we don't have the stock ticker information!
- `loadtable` has a `filenamecol` option that will add a column for us, which we will call `:Ticker`:

In [22]:
stocks = loadtable("stocksample"; filenamecol = :Ticker)

Table with 56023 rows, 8 columns:
Columns:
[1m#  [22m[1mcolname  [22m[1mtype[22m
───────────────────
1  Ticker   String
2  Date     Date
3  Open     Float64
4  High     Float64
5  Low      Float64
6  Close    Float64
7  Volume   Int64
8  OpenInt  Int64

- Furthermore, we can use the `indexcols` to specify how data is sorted.
  - First we'll sort by `:Ticker`
  - Then sort by `:Date`

In [23]:
stocks = loadtable("stocksample"; filenamecol = :Ticker, indexcols = [:Ticker, :Date])

Table with 56023 rows, 8 columns:
Columns:
[1m#  [22m[1mcolname  [22m[1mtype[22m
───────────────────
1  Ticker   String
2  Date     Date
3  Open     Float64
4  High     Float64
5  Low      Float64
6  Close    Float64
7  Volume   Int64
8  OpenInt  Int64

- Notice the printing style has changed.  A summary is printed when the display width is too narrow to print all the columns.  In Jupyter, the width is hardcoded as 80 characters, so the table actually fits in this case.  We can override this behavior with:

In [24]:
IndexedTables.set_show_compact!(false)

stocks

Table with 56023 rows, 8 columns:
[1mTicker         [22m[1mDate        [22mOpen     High     Low      Close    Volume    OpenInt
────────────────────────────────────────────────────────────────────────────────
"aapl.us.txt"  1984-09-07  0.42388  0.42902  0.41874  0.42388  23220030  0
"aapl.us.txt"  1984-09-10  0.42388  0.42516  0.41366  0.42134  18022532  0
"aapl.us.txt"  1984-09-11  0.42516  0.43668  0.42516  0.42902  42498199  0
"aapl.us.txt"  1984-09-12  0.42902  0.43157  0.41618  0.41618  37125801  0
"aapl.us.txt"  1984-09-13  0.43927  0.44052  0.43927  0.43927  57822062  0
"aapl.us.txt"  1984-09-14  0.44052  0.45589  0.44052  0.44566  68847968  0
"aapl.us.txt"  1984-09-17  0.45718  0.46357  0.45718  0.45718  53755262  0
"aapl.us.txt"  1984-09-18  0.45718  0.46103  0.44052  0.44052  27136886  0
"aapl.us.txt"  1984-09-19  0.44052  0.44566  0.43157  0.43157  29641922  0
"aapl.us.txt"  1984-09-20  0.43286  0.43668  0.43286  0.43286  18453585  0
"aapl.us.txt"  1984-09-21  0.43286  

## Motivation for ND Sparse

### What was Apple's closing price on 1986-02-10?

- For a `Table`, this requires a query
- With `NDSparse`, this is just `getindex`
    - Load data as `NDSparse` and use arbitrary indexes (`String` and `Date`) to look up the closing price:

In [25]:
# Load data as NDSparse:

stocksnd = loadndsparse("stocksample", filenamecol=:Ticker, indexcols = [:Ticker, :Date])

# Get the value associated with Apple and 1986-02-10:

stocksnd["aapl.us.txt", Date(1986, 2, 10)].Close

0.38164

# Saving Tables to Disk

- **JuliaDB** can save tables into a binary format that can be loaded efficiently in future Julia sessions.
    - `save(table, destination)`

In [26]:
save(stocks, "stocks.jdb")

Table with 56023 rows, 8 columns:
[1mTicker         [22m[1mDate        [22mOpen     High     Low      Close    Volume    OpenInt
────────────────────────────────────────────────────────────────────────────────
"aapl.us.txt"  1984-09-07  0.42388  0.42902  0.41874  0.42388  23220030  0
"aapl.us.txt"  1984-09-10  0.42388  0.42516  0.41366  0.42134  18022532  0
"aapl.us.txt"  1984-09-11  0.42516  0.43668  0.42516  0.42902  42498199  0
"aapl.us.txt"  1984-09-12  0.42902  0.43157  0.41618  0.41618  37125801  0
"aapl.us.txt"  1984-09-13  0.43927  0.44052  0.43927  0.43927  57822062  0
"aapl.us.txt"  1984-09-14  0.44052  0.45589  0.44052  0.44566  68847968  0
"aapl.us.txt"  1984-09-17  0.45718  0.46357  0.45718  0.45718  53755262  0
"aapl.us.txt"  1984-09-18  0.45718  0.46103  0.44052  0.44052  27136886  0
"aapl.us.txt"  1984-09-19  0.44052  0.44566  0.43157  0.43157  29641922  0
"aapl.us.txt"  1984-09-20  0.43286  0.43668  0.43286  0.43286  18453585  0
"aapl.us.txt"  1984-09-21  0.43286  

# Reloading a Saved Table

- To load a saved table, we use the `load` function rather than `loadtable`.
- This is typically **much** faster than reloading from the CSVs.

In [27]:
@time reloaded_stocks = load("stocks")

  0.025132 seconds (279.94 k allocations: 6.434 MiB)


Table with 56023 rows, 8 columns:
[1mTicker         [22m[1mDate        [22mOpen     High     Low      Close    Volume    OpenInt
────────────────────────────────────────────────────────────────────────────────
"aapl.us.txt"  1984-09-07  0.42388  0.42902  0.41874  0.42388  23220030  0
"aapl.us.txt"  1984-09-10  0.42388  0.42516  0.41366  0.42134  18022532  0
"aapl.us.txt"  1984-09-11  0.42516  0.43668  0.42516  0.42902  42498199  0
"aapl.us.txt"  1984-09-12  0.42902  0.43157  0.41618  0.41618  37125801  0
"aapl.us.txt"  1984-09-13  0.43927  0.44052  0.43927  0.43927  57822062  0
"aapl.us.txt"  1984-09-14  0.44052  0.45589  0.44052  0.44566  68847968  0
"aapl.us.txt"  1984-09-17  0.45718  0.46357  0.45718  0.45718  53755262  0
"aapl.us.txt"  1984-09-18  0.45718  0.46103  0.44052  0.44052  27136886  0
"aapl.us.txt"  1984-09-19  0.44052  0.44566  0.43157  0.43157  29641922  0
"aapl.us.txt"  1984-09-20  0.43286  0.43668  0.43286  0.43286  18453585  0
"aapl.us.txt"  1984-09-21  0.43286  

In [28]:
stocks == reloaded_stocks

true

---

# Using  Selectors

### Selectors are powerful ways to select and manipulate data

1. `Integer`: column at position
2. `Symbol`: column by name
3. `Array`: itself
4. `Pair{Selection => Function}`: function mapped to selection
5. `Tuple` of selections: table of each selection

### Selectors show up in many places (everything in green)

<code>select(t, <span style="color: green">which</span>)
map(f, t; <span style="color: green">select</span>)
reduce(f, t; <span style="color: green">select</span>)
filter(f, t; <span style="color: green">select</span>)
groupby(f, t, <span style="color: green">by</span>; <span style="color: green">select</span>)
groupreduce(f, t, <span style="color: green">by</span>; <span style="color: green">select</span>)
join(f, l, r; how, <span style="color: green">lkey</span>, <span style="color: green">rkey</span>, <span style="color: green">lselect</span>, <span style="color: green">rselect</span>)
groupjoin(f, l, r; how, <span style="color: green">lkey</span>, <span style="color: green">rkey</span>, <span style="color: green">lselect</span>, <span style="color: green">rselect</span>)
</code>

## Selector Examples

- A single selection returns a Vector:

In [29]:
select(stocks, 1)[1:5]

5-element Array{String,1}:
 "aapl.us.txt"
 "aapl.us.txt"
 "aapl.us.txt"
 "aapl.us.txt"
 "aapl.us.txt"

In [30]:
select(stocks, :Date)[1:5]

5-element Array{Date,1}:
 1984-09-07
 1984-09-10
 1984-09-11
 1984-09-12
 1984-09-13

- Multiple selections return a table:

In [31]:
select(stocks, (1, :Date))[1:5]

Table with 5 rows, 2 columns:
[1mTicker         [22m[1mDate[22m
─────────────────────────
"aapl.us.txt"  1984-09-07
"aapl.us.txt"  1984-09-10
"aapl.us.txt"  1984-09-11
"aapl.us.txt"  1984-09-12
"aapl.us.txt"  1984-09-13