![](https://juliacomputing.com/assets/img/new/JuliaDB_logo2.svg)

# Intro to JuliaDB

JuliaDB is an analytical database.  Here are some highlights:

- **Load multi-file datasets**
- **Index the data and perform filter, aggregate, sort and join operations.**
- **Compile queries**
- **Store any data type**
- **Save results and load them efficiently later.**
- **Use Julia's built-in parallelism to fully utilize any machine or cluster.**
- **OnlineStats integration for analytics**

# Helpful links

- [JuliaDB API Reference](http://juliadb.org/latest/api/index.html)
- [OnlineStats Docs](http://joshday.github.io/OnlineStats.jl/stable/)

# Let's get started

1. Create tables from Julia Vectors
1. Some differences between the two main data structures in JuliaDB
1. Load tables from CSVs
1. Save tables into binary format
1. Load a saved table

In [14]:
using JuliaDB

# Some data 
x = [false, true, false, true]
y = ['B', 'B', 'A', 'A']
z = randn(4);

# Data Structures

- `Table`
- `NDSparse`

#### Performance is similar between `Table` and `NDSparse`

Choose whichever makes sense for the data:

- Stock data, which is sparse over the domains of ticker symbol and timestamp, is represented well by `NDSparse`.

## `Table`: sorted by primary key(s)

In [15]:
t = table(@NT(x=x, y=y, z=z); pkey = [:x, :y])

Table with 4 rows, 3 columns:
[1mx      [22m[1my    [22mz
─────────────────────
false  'A'  -0.446344
false  'B'  0.422846
true   'A'  -1.22027
true   'B'  -1.01038

## `NDSparse`: N-dimensional sparse array

In [21]:
nd = ndsparse(@NT(x=x, y=y), @NT(z=z))

2-d NDSparse with 4 values (1 field named tuples):
x      y   │ z
───────────┼──────────
false  'A' │ -0.446344
false  'B' │ 0.422846
true   'A' │ -1.22027
true   'B' │ -1.01038

## Accessing data: `Table` vs. `NDSparse`

In [18]:
t[1]

(x = false, y = 'A', z = -0.4463440310173379)

In [17]:
nd[false, 'A']

-0.4463440310173379

In [22]:
# Table: iterate over NamedTuples of rows
for row in t
    println(row)
end

(x = false, y = 'A', z = -0.4463440310173379)
(x = false, y = 'B', z = 0.42284612557404294)
(x = true, y = 'A', z = -1.220268057158368)
(x = true, y = 'B', z = -1.0103832553454009)


In [23]:
# NDSparse: iterate over NamedTuples of values
for item in nd
    println(item)
end

(z = -0.4463440310173379)
(z = 0.42284612557404294)
(z = -1.220268057158368)
(z = -1.0103832553454009)


# Loading Data

- Example data: 8 different stocks' OHLC data up to November 2017.

In [25]:
;ls stocksample

aapl.us.txt
amzn.us.txt
dis.us.txt
googl.us.txt
ibm.us.txt
msft.us.txt
nflx.us.txt
tsla.us.txt


In [27]:
;head stocksample/aapl.us.txt

Date,Open,High,Low,Close,Volume,OpenInt
1984-09-07,0.42388,0.42902,0.41874,0.42388,23220030,0
1984-09-10,0.42388,0.42516,0.41366,0.42134,18022532,0
1984-09-11,0.42516,0.43668,0.42516,0.42902,42498199,0
1984-09-12,0.42902,0.43157,0.41618,0.41618,37125801,0
1984-09-13,0.43927,0.44052,0.43927,0.43927,57822062,0
1984-09-14,0.44052,0.45589,0.44052,0.44566,68847968,0
1984-09-17,0.45718,0.46357,0.45718,0.45718,53755262,0
1984-09-18,0.45718,0.46103,0.44052,0.44052,27136886,0
1984-09-19,0.44052,0.44566,0.43157,0.43157,29641922,0


In [64]:
stocks = loadtable("stocksample"; filenamecol = :Ticker, indexcols = [:Ticker, :Date])

Table with 56023 rows, 8 columns:
Columns:
[1m#  [22m[1mcolname  [22m[1mtype[22m
───────────────────
1  Ticker   String
2  Date     Date
3  Open     Float64
4  High     Float64
5  Low      Float64
6  Close    Float64
7  Volume   Int64
8  OpenInt  Int64

In [44]:
stocks[1]

(Ticker = "aapl.us.txt", Date = 1984-09-07, Open = 0.42388, High = 0.42902, Low = 0.41874, Close = 0.42388, Volume = 23220030, OpenInt = 0)

### What was Apple's closing price on 1986-02-10?

- With `Table`, this requires a query
- With `NDSparse`, this is just `getindex`

In [57]:
stocksnd = loadndsparse("stocksample", filenamecol=:Ticker, indexcols = [:Ticker, :Date])

stocksnd["aapl.us.txt", Date(1986, 2, 10)]

(Open = 0.38289, High = 0.39186, Low = 0.37906, Close = 0.38164, Volume = 31191161, OpenInt = 0)

# Saving Data

In [58]:
save(stocks, "stocks")

Table with 56023 rows, 8 columns:
Columns:
[1m#  [22m[1mcolname  [22m[1mtype[22m
───────────────────
1  Ticker   String
2  Date     Date
3  Open     Float64
4  High     Float64
5  Low      Float64
6  Close    Float64
7  Volume   Int64
8  OpenInt  Int64

# Reloading data

In [65]:
# This timing is more dramatic with larger datasets
@time stocks2 = load("stocks")

  0.021961 seconds (279.75 k allocations: 6.424 MiB, 28.32% gc time)


Table with 56023 rows, 8 columns:
Columns:
[1m#  [22m[1mcolname  [22m[1mtype[22m
───────────────────
1  Ticker   String
2  Date     Date
3  Open     Float64
4  High     Float64
5  Low      Float64
6  Close    Float64
7  Volume   Int64
8  OpenInt  Int64

In [61]:
stocks == stocks2

true