Skip to content
okay edited this page Jun 12, 2016 · 9 revisions

Summary

Sybil is a command line program that reads JSON records from stdin (one per line) and saves them on disk in a column based format for fast querying. Using column storage lets the query engine reduce the amount of data it needs to read off disk to run aggregations. For table scans that do not touch all the fields in a dataset, this leads to an improvement in query time over traditional row or document based DBs. Somewhat differently from other DBs, Sybil also runs its full table scans in parallel (chewing up as many CPUs as GOMAXPROCS allows) in order to speed up query time.

Installation

# check go path
echo $GOPATH

# if not set, you can set it with 'export'
# (also put the below in your .bashrc to set it when you log in again)
export GOPATH=~/go; 

# mkdir the GOPATH if it doesn't exist
mkdir $GOPATH

# install sybil
go get github.com/logv/sybil

Data Format

Sybil uses ordinary JSON notation to ingest records. An example sample looks like:

{
  // time is in seconds since the epoch and is the only required field
  time: 1461765374, 
  // sybil supports ints (up to int64)
  age: 28,
  // sybil supports string columns
  country: "USA",
  state: "NY",
  favorite_food: "ice cream",
  gym_membership: "no"
  // and sybil supports sets
  favorite_bands: [ "the doors", "talking heads" ]
}

Sybil doesn't require that table schemas be defined beforehand, but it does prefer that if a column is defined, the type below does not get changed. It's very important to notice that "0" and 0 are not the same in JSON! One is a string, while one is an integer.

Importing Data

the sybil binary is split into multiple subcommands. To import data, use the 'ingest' command and supply JSON samples on stdin, one per line this will create a new dir, 'db/my_first_table'

sybil ingest -table my_first_table < json_samples.json

import from a mongo DB

mongoexport -collection my_collection | sybil ingest -table my_table

import from a CSV file. requires that the first line be comma separated headers

sybil ingest -csv -table my_csv_table < some_csv.csv

examine the db file structure

ls -R db/
# look at disk space usage
du -ch db/

Querying Data

Sybil supports several query types: rollup (aka Table), time series, distributions and raw samples.

Rollup queries are the default query in sybil. The fields to group by, fields to aggregate and filters are supplied via the command line and sybil prints either a formatted table or JSON output to stdout. A simple query would be: sybil query -table my_table -group col1,col2,col3 -int col4,col5,col6 which would output a formatted table of data.

You can use -json flag to have sybil print the output in JSON. By default, sybil is pretty verbose on STDERR. Redirect stderr to quiet sybil down and to see just the results

To see table info or a list of tables is pretty easy: sybil query -tables and sybil query -table my_table -info. Retrieving samples is similar: sybil query -table my_table -samples -limit 5

Filters

Most queries support filters - filters are tested against each record before the aggregation and used to determine whether a record should be included. Filters are supplied as command line arguments to sybil. The format for a filter string is: -*-filter col:op:val,col:op:val where filter is one of

  • -str-filter - supports string regexes using re and nre
  • -int-filter - supports eq, neq, gt and lt
  • -set-filter - supports in and nin

An easy and common filter trick is to use the date command:

 # specify a filter on time greater than 1 hour ago
 -int-filter time_col:gt:`date --date="-1 hour" +%s`

Time Queries

To run a time series query in sybil, specify the -time, -time-col <FIELD> and optionally -time-bucket <SECONDS> flags to sybil. Adding an int filter on the time range is useful, because it lets sybil only look at blocks relevant to your query.

Distributions

Sybil also supports histogram queries by supplying the '-op hist' flag. Supplying the -hist flag tells sybil to create a histogram for each row in the group by result.

More Info

There are more examples and information around the rest of this wiki. Please get in touch if you want have any questions, comments, feedback or want any more information. Thanks!