# <center>RumbleDB sandbox</center>


This is a RumbleDB sandbox that allows you to play with simple JSONiq queries.

It is a jupyter notebook that you can also download and execute on your own machine, but if you arrived here from the RumbleDB website, it is likely to be shown within Google's Colab environment.

To get started, you first need to execute the cell below to activate the RumbleDB magic (you do not need to understand what it does, this is just initialization Python code).

In [None]:
!pip install rumbledb
%load_ext rumbledb
%env RUMBLEDB_SERVER=http://public.rumbledb.org:9090/jsoniq

By default, this notebook uses a small public backend provided by us. Each query runs on just one machine that is very limited in CPU: one core and memory: 1GB, and with only the http scheme activated. This is sufficient to discover RumbleDB and play a bit, but of course is not intended for any production use. If you need to use RumbleDB in production, you can use it with an installation of Spark either on your machine or on a cluster.

This sandbox backend may occasionally break, especially if too many users use it at the same time, so please bear with us! The system is automatically restarted every day so, if it stops working, you can either try again in 24 hours or notify us.


It is straightforward to execute your own RumbleDB server on your own Spark cluster (and then you can make full use of all the input file systems and file formats). In this case, just replace the above server with your own hostname and port. Note that if you run RumbleDB as a server locally, you will also need to download and use this notebook locally rather than in this Google Colab environment as, obviously, your personal computer cannot be accessed from the Web.

Now we are all set! You can now start reading and executing the JSONiq queries as you go, and you can even edit them!

## JSON

As explained on the [official JSON Web site](http://www.json.org/), JSON is a lightweight data-interchange format designed for humans as well as for computers. It supports as values:
- objects (string-to-value maps)
- arrays (ordered sequences of values)
- strings
- numbers
- booleans (true, false)
- null

JSONiq provides declarative querying and updating capabilities on JSON data.

## Elevator Pitch

JSONiq is based on XQuery, which is a W3C standard (like XML and HTML). XQuery is a very powerful declarative language that originally manipulates XML data, but it turns out that it is also a very good fit for manipulating JSON natively.
JSONiq, since it extends XQuery, is a very powerful general-purpose declarative programming language. Our experience is that, for the same task, you will probably write about 80% less code compared to imperative languages like JavaScript, Python or Ruby. Additionally, you get the benefits of strong type checking without actually having to write type declarations.
Here is an appetizer before we start the tutorial from scratch.


In [15]:
%%jsoniq

let $stores :=
[
  { "store number" : 1, "state" : "MA" },
  { "store number" : 2, "state" : "MA" },
  { "store number" : 3, "state" : "CA" },
  { "store number" : 4, "state" : "CA" }
]
let $sales := [
   { "product" : "broiler", "store number" : 1, "quantity" : 20  },
   { "product" : "toaster", "store number" : 2, "quantity" : 100 },
   { "product" : "toaster", "store number" : 2, "quantity" : 50 },
   { "product" : "toaster", "store number" : 3, "quantity" : 50 },
   { "product" : "blender", "store number" : 3, "quantity" : 100 },
   { "product" : "blender", "store number" : 3, "quantity" : 150 },
   { "product" : "socks", "store number" : 1, "quantity" : 500 },
   { "product" : "socks", "store number" : 2, "quantity" : 10 },
   { "product" : "shirt", "store number" : 3, "quantity" : 10 }
]
let $join :=
  for $store in $stores[], $sale in $sales[]
  where $store."store number" = $sale."store number"
  return {
    "nb" : $store."store number",
    "state" : $store.state,
    "sold" : $sale.product
  }
return [$join]



Took: 0.4020528793334961 ms
[{"nb": 1, "state": "MA", "sold": "broiler"}, {"nb": 1, "state": "MA", "sold": "socks"}, {"nb": 2, "state": "MA", "sold": "toaster"}, {"nb": 2, "state": "MA", "sold": "toaster"}, {"nb": 2, "state": "MA", "sold": "socks"}, {"nb": 3, "state": "CA", "sold": "toaster"}, {"nb": 3, "state": "CA", "sold": "blender"}, {"nb": 3, "state": "CA", "sold": "blender"}, {"nb": 3, "state": "CA", "sold": "shirt"}]


## All JSON values are JSONiq, too

The first thing you need to know is that a well-formed JSON document is a JSONiq expression as well.
This means that you can copy-and-paste any JSON document into a query. The following are JSONiq queries that are "idempotent" (they just output themselves):

In [3]:
%%jsoniq
{ "pi" : 3.14, "sq2" : 1.4 }

Took: 0.05497384071350098 ms
{"pi": 3.14, "sq2": 1.4}


In [4]:
%%jsoniq
[ 2, 3, 5, 7, 11, 13 ]

Took: 0.07255315780639648 ms
[2, 3, 5, 7, 11, 13]


In [5]:
%%jsoniq
{
      "operations" : [
        { "binary" : [ "and", "or"] },
        { "unary" : ["not"] }
      ],
      "bits" : [
        0, 1
      ]
    }

Took: 0.06504130363464355 ms
{"operations": [{"binary": ["and", "or"]}, {"unary": ["not"]}], "bits": [0, 1]}


In [6]:
%%jsoniq
[ { "Question" : "Ultimate" }, ["Life", "the universe", "and everything"] ]

Took: 0.08156394958496094 ms
[{"Question": "Ultimate"}, ["Life", "the universe", "and everything"]]


This works with objects, arrays (even nested), strings, numbers, booleans, null.

It also works the other way round: if your query outputs an object, you can use it as a JSON document.
JSONiq is a declarative language. This means that you only need to say what you want - the compiler will take care of the how. 

In the above queries, you are basically saying: I want to output this JSON content, and here it is.

## Navigating an existing JSON dataset

Next, let us look at an existing dataset on the Web. We picked a [GitHub archive file](https://gharchive.org)
that we stored for convenience at this location: https://www.rumbledb.org/samples/git-archive.json.

Accessing a JSON dataset can be done in two ways depending on the exact format:

- If this is a file that contains a single JSON object spread over multiple lines, use json-doc(URL).
- If this is a file that contains one JSON object per line (JSON Lines), use json-file(URL).

The GitHub archive dataset is in the JSON Lines format, so we open it with json-file.

In [103]:
%%jsoniq
json-file("http://www.rumbledb.org/samples/git-archive-small.json")

Took: 0.4589049816131592 ms
{"id": "7045118886", "type": "PushEvent", "actor": {"id": 20090775, "login": "lainrose", "display_login": "lainrose", "gravatar_id": "", "url": "https://api.github.com/users/lainrose", "avatar_url": "https://avatars.githubusercontent.com/u/20090775?"}, "repo": {"id": 115387592, "name": "lainrose/Python-Grammar", "url": "https://api.github.com/repos/lainrose/Python-Grammar"}, "payload": {"push_id": 2226161589, "size": 1, "distinct_size": 1, "ref": "refs/heads/master", "head": "27a01fbdbec8e26daa40fc8faa052dd0be23836a", "before": "d6fce97b8de28a31d021c9a9f7ac939baa14d208", "commits": [{"sha": "27a01fbdbec8e26daa40fc8faa052dd0be23836a", "author": {"name": "lainrose", "email": "fb4676bf30682e2ece361fd363a69ad11779c42e@Naver.com"}, "message": "Update Study Contents", "distinct": true, "url": "https://api.github.com/repos/lainrose/Python-Grammar/commits/27a01fbdbec8e26daa40fc8faa052dd0be23836a"}]}, "public": true, "created_at": "2018-01-01T15:00:00Z"}
{"id": "7045

This is a large file and the previous query output 500 JSON objects. To look closer, let us start looking at just the first object with a number predicate.

In [92]:
%%jsoniq
json-file("http://www.rumbledb.org/samples/git-archive-small.json")[1]

Took: 0.19023895263671875 ms
{"id": "7045118886", "type": "PushEvent", "actor": {"id": 20090775, "login": "lainrose", "display_login": "lainrose", "gravatar_id": "", "url": "https://api.github.com/users/lainrose", "avatar_url": "https://avatars.githubusercontent.com/u/20090775?"}, "repo": {"id": 115387592, "name": "lainrose/Python-Grammar", "url": "https://api.github.com/repos/lainrose/Python-Grammar"}, "payload": {"push_id": 2226161589, "size": 1, "distinct_size": 1, "ref": "refs/heads/master", "head": "27a01fbdbec8e26daa40fc8faa052dd0be23836a", "before": "d6fce97b8de28a31d021c9a9f7ac939baa14d208", "commits": [{"sha": "27a01fbdbec8e26daa40fc8faa052dd0be23836a", "author": {"name": "lainrose", "email": "fb4676bf30682e2ece361fd363a69ad11779c42e@Naver.com"}, "message": "Update Study Contents", "distinct": true, "url": "https://api.github.com/repos/lainrose/Python-Grammar/commits/27a01fbdbec8e26daa40fc8faa052dd0be23836a"}]}, "public": true, "created_at": "2018-01-01T15:00:00Z"}


We can see that there are nested objects and arrays. This is perfect for JSONiq. Let us now figure out all the keys used in this dataset with the keys() function.

In [93]:
%%jsoniq
keys(json-file("http://www.rumbledb.org/samples/git-archive-small.json"))

Took: 0.8561930656433105 ms
"repo"
"type"
"created_at"
"payload"
"org"
"id"
"public"
"actor"


Let us look closer at the key called "type". What values does it take? We can use dot-based navigation to navigate down to these values. This will work nicely on the entire dataset.

In [94]:
%%jsoniq
json-file("http://www.rumbledb.org/samples/git-archive-small.json").type

Took: 0.29942774772644043 ms
"PushEvent"
"PushEvent"
"PullRequestEvent"
"PushEvent"
"WatchEvent"
"PushEvent"
"GollumEvent"
"PushEvent"
"PullRequestEvent"
"PushEvent"
"PushEvent"
"IssuesEvent"
"PushEvent"
"PullRequestEvent"
"WatchEvent"
"PushEvent"
"WatchEvent"
"PullRequestEvent"
"IssueCommentEvent"
"PushEvent"
"PushEvent"
"ForkEvent"
"PushEvent"
"IssueCommentEvent"
"CreateEvent"
"IssuesEvent"
"PushEvent"
"PushEvent"
"PushEvent"
"CreateEvent"
"CreateEvent"
"PushEvent"
"ForkEvent"
"CreateEvent"
"CreateEvent"
"PushEvent"
"IssueCommentEvent"
"PushEvent"
"PushEvent"
"ForkEvent"
"WatchEvent"
"PushEvent"
"DeleteEvent"
"PushEvent"
"PushEvent"
"IssueCommentEvent"
"PushEvent"
"CreateEvent"
"WatchEvent"
"PushEvent"
"PushEvent"
"ForkEvent"
"PullRequestEvent"
"IssuesEvent"
"PushEvent"
"WatchEvent"
"PushEvent"
"PushEvent"
"PushEvent"
"PushEvent"
"PushEvent"
"PushEvent"
"PushEvent"
"PushEvent"
"PushEvent"
"PushEvent"
"CreateEvent"
"PushEvent"
"PushEvent"
"WatchEvent"
"CreateEvent"
"PushEvent"
"PushEv

It looks like there are a lot of duplicates in there. Let us use distinct-values() to figure out all unique values.

In [95]:
%%jsoniq
distinct-values(json-file("http://www.rumbledb.org/samples/git-archive-small.json").type)

Took: 0.7523598670959473 ms
"CommitCommentEvent"
"IssueCommentEvent"
"PullRequestEvent"
"ReleaseEvent"
"MemberEvent"
"PushEvent"
"IssuesEvent"
"GollumEvent"
"ForkEvent"
"PullRequestReviewCommentEvent"
"DeleteEvent"
"CreateEvent"
"WatchEvent"


So we see that for the key "type", all values are strings and there are only... how many, by the way? Let us use count().

In [96]:
%%jsoniq
count(distinct-values(json-file("http://www.rumbledb.org/samples/git-archive-small.json").type))

Took: 0.3749690055847168 ms
13


So there are 13. Note that count() works just as well on the entire dataset, to know how many objects there are.

In [97]:
%%jsoniq
count(json-file("http://www.rumbledb.org/samples/git-archive-small.json"))

Took: 0.22742390632629395 ms
500


Let us know look at nested objects. It seems the key "actor" has these, so let us now use the dot object lookup to find all these values.

In [98]:
%%jsoniq
json-file("http://www.rumbledb.org/samples/git-archive-small.json").actor

Took: 0.3087191581726074 ms
{"id": 20090775, "login": "lainrose", "display_login": "lainrose", "gravatar_id": "", "url": "https://api.github.com/users/lainrose", "avatar_url": "https://avatars.githubusercontent.com/u/20090775?"}
{"id": 17426563, "login": "tumhopaasmere", "display_login": "tumhopaasmere", "gravatar_id": "", "url": "https://api.github.com/users/tumhopaasmere", "avatar_url": "https://avatars.githubusercontent.com/u/17426563?"}
{"id": 1449578, "login": "daa84", "display_login": "daa84", "gravatar_id": "", "url": "https://api.github.com/users/daa84", "avatar_url": "https://avatars.githubusercontent.com/u/1449578?"}
{"id": 22536460, "login": "thautwarm", "display_login": "thautwarm", "gravatar_id": "", "url": "https://api.github.com/users/thautwarm", "avatar_url": "https://avatars.githubusercontent.com/u/22536460?"}
{"id": 18603467, "login": "markstachowski", "display_login": "markstachowski", "gravatar_id": "", "url": "https://api.github.com/users/markstachowski", "avatar_u

We can chain dot object lookups to navigate further down, for example to logins. Let us figure out how many distinct logins there are.

In [99]:
%%jsoniq
count(distinct-values(json-file("http://www.rumbledb.org/samples/git-archive-small.json").actor.login))

Took: 0.5600190162658691 ms
374


The id field inside the actor object seems to be an integer. What is the highest value? The max() function also works at large scales, just like count() and also min(), avg() and sum().

In [100]:
%%jsoniq
max(json-file("http://www.rumbledb.org/samples/git-archive-small.json").actor.id)

Took: 0.3411102294921875 ms
35003609


Alright, let us know look for nested arrays. There does not seem to have any inside the actor object, so let us try the key "payload". Let us just look at the first one.

In [101]:
%%jsoniq
json-file("http://www.rumbledb.org/samples/git-archive-small.json")[1].payload

Took: 0.17327165603637695 ms
{"push_id": 2226161589, "size": 1, "distinct_size": 1, "ref": "refs/heads/master", "head": "27a01fbdbec8e26daa40fc8faa052dd0be23836a", "before": "d6fce97b8de28a31d021c9a9f7ac939baa14d208", "commits": [{"sha": "27a01fbdbec8e26daa40fc8faa052dd0be23836a", "author": {"name": "lainrose", "email": "fb4676bf30682e2ece361fd363a69ad11779c42e@Naver.com"}, "message": "Update Study Contents", "distinct": true, "url": "https://api.github.com/repos/lainrose/Python-Grammar/commits/27a01fbdbec8e26daa40fc8faa052dd0be23836a"}]}


Here we see that there is a nested array associated with key "commits".

In [102]:
%%jsoniq
json-file("http://www.rumbledb.org/samples/git-archive-small.json")[1].payload.commits

Took: 0.17044281959533691 ms
[{"sha": "27a01fbdbec8e26daa40fc8faa052dd0be23836a", "author": {"name": "lainrose", "email": "fb4676bf30682e2ece361fd363a69ad11779c42e@Naver.com"}, "message": "Update Study Contents", "distinct": true, "url": "https://api.github.com/repos/lainrose/Python-Grammar/commits/27a01fbdbec8e26daa40fc8faa052dd0be23836a"}]


In this case, there is only one object in this array. Is there, by any chance, any one of these arrays that has more than one commit? For this, we can use a Boolean predicate. Let us evaluate the predicate

size($$) gt 1

which uses the size function and the gt (greater than) comparison and where $$ is the current array being tested.

In [50]:
%%jsoniq
json-file("http://www.rumbledb.org/samples/git-archive.json").payload.commits[size($$) gt 1]

Took: 1.6675620079040527 ms
[{"sha": "95e600df9a5a669f53dc7de28147814678d12e81", "author": {"name": "Phil Gengler", "email": "e888d2bd6f13f82caa51a37c03d034c76f661ba3@pgengler.net"}, "message": "Get days/tasks via JSONAPI", "distinct": true, "url": "https://api.github.com/repos/pgengler/todolist-client/commits/95e600df9a5a669f53dc7de28147814678d12e81"}, {"sha": "d348f84df64c5473ba6a95a108e7c0263a434add", "author": {"name": "Phil Gengler", "email": "e888d2bd6f13f82caa51a37c03d034c76f661ba3@pgengler.net"}, "message": "Update tests", "distinct": true, "url": "https://api.github.com/repos/pgengler/todolist-client/commits/d348f84df64c5473ba6a95a108e7c0263a434add"}, {"sha": "9227c61c103ec1ee7b6dc8e126d14bc85fdf3dfd", "author": {"name": "Phil Gengler", "email": "e888d2bd6f13f82caa51a37c03d034c76f661ba3@pgengler.net"}, "message": "Migrate to unified List model", "distinct": true, "url": "https://api.github.com/repos/pgengler/todolist-client/commits/9227c61c103ec1ee7b6dc8e126d14bc85fdf3dfd"}, {

Let us just take the first one to have more visibility.

In [47]:
%%jsoniq
json-file("http://www.rumbledb.org/samples/git-archive.json").payload.commits[size($$) gt 1][1]

Took: 0.9169921875 ms
[{"sha": "95e600df9a5a669f53dc7de28147814678d12e81", "author": {"name": "Phil Gengler", "email": "e888d2bd6f13f82caa51a37c03d034c76f661ba3@pgengler.net"}, "message": "Get days/tasks via JSONAPI", "distinct": true, "url": "https://api.github.com/repos/pgengler/todolist-client/commits/95e600df9a5a669f53dc7de28147814678d12e81"}, {"sha": "d348f84df64c5473ba6a95a108e7c0263a434add", "author": {"name": "Phil Gengler", "email": "e888d2bd6f13f82caa51a37c03d034c76f661ba3@pgengler.net"}, "message": "Update tests", "distinct": true, "url": "https://api.github.com/repos/pgengler/todolist-client/commits/d348f84df64c5473ba6a95a108e7c0263a434add"}, {"sha": "9227c61c103ec1ee7b6dc8e126d14bc85fdf3dfd", "author": {"name": "Phil Gengler", "email": "e888d2bd6f13f82caa51a37c03d034c76f661ba3@pgengler.net"}, "message": "Migrate to unified List model", "distinct": true, "url": "https://api.github.com/repos/pgengler/todolist-client/commits/9227c61c103ec1ee7b6dc8e126d14bc85fdf3dfd"}, {"sha":

We can expand it to a sequence of objects using the [] array unboxing syntax.

In [51]:
%%jsoniq
json-file("http://www.rumbledb.org/samples/git-archive.json").payload.commits[size($$) gt 1][1][]

Took: 0.9378659725189209 ms
{"sha": "95e600df9a5a669f53dc7de28147814678d12e81", "author": {"name": "Phil Gengler", "email": "e888d2bd6f13f82caa51a37c03d034c76f661ba3@pgengler.net"}, "message": "Get days/tasks via JSONAPI", "distinct": true, "url": "https://api.github.com/repos/pgengler/todolist-client/commits/95e600df9a5a669f53dc7de28147814678d12e81"}
{"sha": "d348f84df64c5473ba6a95a108e7c0263a434add", "author": {"name": "Phil Gengler", "email": "e888d2bd6f13f82caa51a37c03d034c76f661ba3@pgengler.net"}, "message": "Update tests", "distinct": true, "url": "https://api.github.com/repos/pgengler/todolist-client/commits/d348f84df64c5473ba6a95a108e7c0263a434add"}
{"sha": "9227c61c103ec1ee7b6dc8e126d14bc85fdf3dfd", "author": {"name": "Phil Gengler", "email": "e888d2bd6f13f82caa51a37c03d034c76f661ba3@pgengler.net"}, "message": "Migrate to unified List model", "distinct": true, "url": "https://api.github.com/repos/pgengler/todolist-client/commits/9227c61c103ec1ee7b6dc8e126d14bc85fdf3dfd"}
{"sha

We can also lookup a specific position, say, the second object, with the [[ ]] array lookup syntax.

In [58]:
%%jsoniq
json-file("http://www.rumbledb.org/samples/git-archive.json").payload.commits[size($$) gt 1][1][[2]]

Took: 0.9309651851654053 ms
{"sha": "d348f84df64c5473ba6a95a108e7c0263a434add", "author": {"name": "Phil Gengler", "email": "e888d2bd6f13f82caa51a37c03d034c76f661ba3@pgengler.net"}, "message": "Update tests", "distinct": true, "url": "https://api.github.com/repos/pgengler/todolist-client/commits/d348f84df64c5473ba6a95a108e7c0263a434add"}


And now, please hold for something awesome. We can unbox all arrays of the entire collection at the same time by just using the [] syntax on the entire dataset.

In [52]:
%%jsoniq
json-file("http://www.rumbledb.org/samples/git-archive.json").payload.commits[]

Took: 1.629582166671753 ms
{"sha": "27a01fbdbec8e26daa40fc8faa052dd0be23836a", "author": {"name": "lainrose", "email": "fb4676bf30682e2ece361fd363a69ad11779c42e@Naver.com"}, "message": "Update Study Contents", "distinct": true, "url": "https://api.github.com/repos/lainrose/Python-Grammar/commits/27a01fbdbec8e26daa40fc8faa052dd0be23836a"}
{"sha": "45b2f857540d7d4286d1abef204aef167190be0f", "author": {"name": "tumhopaasmere", "email": "bcc6c59276ad7bbcd0b972dd58baaef7cccc22d4@mailinator.com"}, "message": "GIT CloneShare Commit", "distinct": true, "url": "https://api.github.com/repos/tumhopaasmere/tumhopaasmere/commits/45b2f857540d7d4286d1abef204aef167190be0f"}
{"sha": "ea291a9baea441ea815e822bba5e8c9f330542f7", "author": {"name": "thautwarm", "email": "820a7b45b87f3c40f5e1c273015816c9c19a8401@outlook.com"}, "message": "API overview and example", "distinct": true, "url": "https://api.github.com/repos/thautwarm/EBNFParser/commits/ea291a9baea441ea815e822bba5e8c9f330542f7"}
{"sha": "95e600df

These are objects. It is all too tempting to navigate further down with more dot object-lookup syntax. All at the same time, obviously. Let us figure out how many unique emails there are in all commits of all events.

In [57]:
%%jsoniq
count(distinct-values(json-file("http://www.rumbledb.org/samples/git-archive.json").payload.commits[].author.email))

Took: 1.5451831817626953 ms
10275


Now, what are all unique emails of the first commits?

In [59]:
%%jsoniq
count(distinct-values(json-file("http://www.rumbledb.org/samples/git-archive.json").payload.commits[[1]].author.email))

Took: 1.5219521522521973 ms
9363


You have now learned how to navigate large JSON datasets with the dot object lookup syntax, the [] array unboxing syntax, the [[ ]] array lookup syntax, number predicates, and Boolean predicates.

All of these work nicely on very large sequences, and you can chain them arbitrarily. In fact, this will all happen in parallel on the cores of your machine or even on a large cluster.

You also saw how to aggregate large sequences of values with min, max, count, avg and sum.

Finally, you saw how to eliminate duplicates with distinct-values.

## JSONiq basics

### The real JSONiq Hello, World!

Wondering what a hello world program looks like in JSONiq? Here it is:

In [7]:
%%jsoniq
"Hello, World!"

Took: 0.05169677734375 ms
"Hello, World!"


Not surprisingly, it outputs the string "Hello, World!".

### Numbers and arithmetic operations

Okay, so, now, you might be thinking: "What is the use of this language if it just outputs what I put in?" Of course, JSONiq can more than that. And still in a declarative way. Here is how it works with numbers:

In [8]:
%%jsoniq
2 + 2

Took: 0.06433320045471191 ms
4


In [9]:
%%jsoniq
 (38 + 2) div 2 + 11 * 2


Took: 0.12616300582885742 ms
42


(mind the division operator which is the "div" keyword. The slash operator has different semantics).

Like JSON, JSONiq works with decimals and doubles:

In [10]:
%%jsoniq
 6.022e23 * 42

Took: 0.06836986541748047 ms
2.52924e+25


### Logical operations

JSONiq supports boolean operations.

In [57]:
%%jsoniq
true and false

Took: 0.006527900695800781 ms
false


In [58]:
%%jsoniq
(true or false) and (false or true)

Took: 0.007046222686767578 ms
true


The unary not is also available:

In [59]:
%%jsoniq
not true

Took: 0.006941080093383789 ms
false


### Strings

JSONiq is capable of manipulating strings as well, using functions:


In [60]:
%%jsoniq
concat("Hello ", "Captain ", "Kirk")

Took: 0.005676984786987305 ms
"Hello Captain Kirk"


In [61]:
%%jsoniq
substring("Mister Spock", 8, 5)

Took: 0.00574493408203125 ms
"Spock"


JSONiq comes up with a rich string function library out of the box, inherited from its base language. These functions are listed [here](https://www.w3.org/TR/xpath-functions-30/) (actually, you will find many more for numbers, dates, etc).



### Sequences

Until now, we have only been working with single values (an object, an array, a number, a string, a boolean). JSONiq supports sequences of values. You can build a sequence using commas:


In [62]:
%%jsoniq
 (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

Took: 0.0066449642181396484 ms
1
2
3
4
5
6
7
8
9
10


In [63]:
%%jsoniq
1, true, 4.2e1, "Life"

Took: 0.00654292106628418 ms
1
true
42
"Life"


The "to" operator is very convenient, too:

In [64]:
%%jsoniq
 (1 to 100)

Took: 0.006345033645629883 ms
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100


Some functions even work on sequences:

In [65]:
%%jsoniq
sum(1 to 100)

Took: 0.005728006362915039 ms
5050


In [66]:
%%jsoniq
string-join(("These", "are", "some", "words"), "-")

Took: 0.0058438777923583984 ms
"These-are-some-words"


In [67]:
%%jsoniq
count(10 to 20)

Took: 0.0066111087799072266 ms
11


In [68]:
%%jsoniq
avg(1 to 100)

Took: 0.005938053131103516 ms
50.5


Unlike arrays, sequences are flat. The sequence (3) is identical to the integer 3, and (1, (2, 3)) is identical to (1, 2, 3).

## A bit more in depth

### Variables

You can bind a sequence of values to a (dollar-prefixed) variable, like so:

In [69]:
%%jsoniq
let $x := "Bearing 3 1 4 Mark 5. "
return concat($x, "Engage!")

Took: 0.007143735885620117 ms
"Bearing 3 1 4 Mark 5. Engage!"


In [70]:
%%jsoniq
let $x := ("Kirk", "Picard", "Sisko")
return string-join($x, " and ")

Took: 0.006165742874145508 ms
"Kirk and Picard and Sisko"


You can bind as many variables as you want:

In [71]:
%%jsoniq
let $x := 1
let $y := $x * 2
let $z := $y + $x
return ($x, $y, $z)

Took: 0.006880044937133789 ms
1
2
3


and even reuse the same name to hide formerly declared variables:

In [72]:
%%jsoniq
let $x := 1
let $x := $x + 2
let $x := $x + 3
return $x

Took: 0.006127119064331055 ms
6


### Iteration

In a way very similar to let, you can iterate over a sequence of values with the "for" keyword. Instead of binding the entire sequence of the variable, it will bind each value of the sequence in turn to this variable.

In [73]:
%%jsoniq
for $i in 1 to 10
return $i * 2

Took: 0.006555080413818359 ms
2
4
6
8
10
12
14
16
18
20


More interestingly, you can combine fors and lets like so:

In [74]:
%%jsoniq
let $sequence := 1 to 10
for $value in $sequence
let $square := $value * 2
return $square

Took: 0.006516933441162109 ms
2
4
6
8
10
12
14
16
18
20


and even filter out some values:

In [75]:
%%jsoniq
let $sequence := 1 to 10
for $value in $sequence
let $square := $value * 2
where $square < 10
return $square

Took: 0.0077419281005859375 ms
2
4
6
8


Note that you can only iterate over sequences, not arrays. To iterate over an array, you can obtain the sequence of its values with the [] operator, like so:


In [76]:
%%jsoniq
[1, 2, 3][]

Took: 0.006000041961669922 ms
1
2
3


### Conditions

You can make the output depend on a condition with an if-then-else construct:

In [77]:
%%jsoniq
for $x in 1 to 10
return if ($x < 5) then $x
                   else -$x

Took: 0.0064771175384521484 ms
1
2
3
4
-5
-6
-7
-8
-9
-10


Note that the else clause is required - however, it can be the empty sequence () which is often when you need if only the then clause is relevant to you.

### Composability of Expressions

Now that you know of a couple of elementary JSONiq expressions, you can combine them in more elaborate expressions. For example, you can put any sequence of values in an array:

In [78]:
%%jsoniq
[ 1 to 10 ]

Took: 0.007096052169799805 ms
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]


Or you can dynamically compute the value of object pairs (or their key):

In [79]:
%%jsoniq
{
      "Greeting" : (let $d := "Mister Spock"
                    return concat("Hello, ", $d)),
      "Farewell" : string-join(("Live", "long", "and", "prosper"),
                               " ")
}

Took: 0.007810831069946289 ms
{"Greeting": "Hello, Mister Spock", "Farewell": "Live long and prosper"}


You can dynamically generate object singletons (with a single pair):


In [80]:
%%jsoniq
{ concat("Integer ", 2) : 2 * 2 }

Took: 0.006745100021362305 ms
{"Integer 2": 4}


and then merge lots of them into a new object with the {| |} notation:

In [81]:
%%jsoniq
{|
    for $i in 1 to 10
    return { concat("Square of ", $i) : $i * $i }
|}

Took: 0.006300926208496094 ms
{"Square of 1": 1, "Square of 2": 4, "Square of 3": 9, "Square of 4": 16, "Square of 5": 25, "Square of 6": 36, "Square of 7": 49, "Square of 8": 64, "Square of 9": 81, "Square of 10": 100}


## JSON Navigation

Up to now, you have learnt how to compose expressions so as to do some computations and to build objects and arrays. It also works the other way round: if you have some JSON data, you can access it and navigate.
All you need to know is: JSONiq views
an array as an ordered list of values,
an object as a set of name/value pairs


### Objects

You can use the dot operator to retrieve the value associated with a key. Quotes are optional, except if the key has special characters such as spaces. It will return the value associated thereto:

In [82]:
%%jsoniq
let $person := {
    "first name" : "Sarah",
    "age" : 13,
    "gender" : "female",
    "friends" : [ "Jim", "Mary", "Jennifer"]
}
return $person."first name"

Took: 0.009386062622070312 ms
"Sarah"


You can also ask for all keys in an object:

In [83]:
%%jsoniq
let $person := {
    "name" : "Sarah",
    "age" : 13,
    "gender" : "female",
    "friends" : [ "Jim", "Mary", "Jennifer"]
}
return { "keys" : [ keys($person)] }

Took: 0.00790095329284668 ms
{"keys": ["name", "age", "gender", "friends"]}


### Arrays

The [[]] operator retrieves the entry at the given position:

In [84]:
%%jsoniq
let $friends := [ "Jim", "Mary", "Jennifer"]
return $friends[[1+1]]

Took: 0.00620579719543457 ms
"Mary"


It is also possible to get the size of an array:

In [85]:
%%jsoniq
let $person := {
    "name" : "Sarah",
    "age" : 13,
    "gender" : "female",
    "friends" : [ "Jim", "Mary", "Jennifer"]
}
return { "how many friends" : size($person.friends) }

Took: 0.006299018859863281 ms
{"how many friends": 3}


Finally, the [] operator returns all elements in an array, as a sequence:

In [86]:
%%jsoniq
let $person := {
    "name" : "Sarah",
    "age" : 13,
    "gender" : "female",
    "friends" : [ "Jim", "Mary", "Jennifer"]
}
return $person.friends[]

Took: 0.0063228607177734375 ms
"Jim"
"Mary"
"Jennifer"


### Relational Algebra

Do you remember SQL's SELECT FROM WHERE statements? JSONiq inherits selection, projection and join capability from XQuery, too.

In [None]:
%%jsoniq
let $stores :=
[
    { "store number" : 1, "state" : "MA" },
    { "store number" : 2, "state" : "MA" },
    { "store number" : 3, "state" : "CA" },
    { "store number" : 4, "state" : "CA" }
]
let $sales := [
    { "product" : "broiler", "store number" : 1, "quantity" : 20  },
    { "product" : "toaster", "store number" : 2, "quantity" : 100 },
    { "product" : "toaster", "store number" : 2, "quantity" : 50 },
    { "product" : "toaster", "store number" : 3, "quantity" : 50 },
    { "product" : "blender", "store number" : 3, "quantity" : 100 },
    { "product" : "blender", "store number" : 3, "quantity" : 150 },
    { "product" : "socks", "store number" : 1, "quantity" : 500 },
    { "product" : "socks", "store number" : 2, "quantity" : 10 },
    { "product" : "shirt", "store number" : 3, "quantity" : 10 }
]
let $join :=
    for $store in $stores[], $sale in $sales[]
    where $store."store number" = $sale."store number"
    return {
        "nb" : $store."store number",
        "state" : $store.state,
        "sold" : $sale.product
    }
return [$join]

### Access datasets

RumbleDB can read input from many file systems and many file formats. If you are using our backend, you can only use json-doc() with any URI pointing to a JSON file and navigate it as you see fit. 

You can read data from your local disk, from S3, from HDFS, and also from the Web. For this tutorial, we'll read from the Web because, well, we are already on the Web.

We have put a sample at http://rumbledb.org/samples/products-small.json that contains 100,000 small objects like:



In [8]:
%%jsoniq
json-file("http://rumbledb.org/samples/products-small.json", 10)[1]

Took: 5.183954954147339 ms
{"product": "blender", "store-number": 20, "quantity": 920}


The second parameter to json-file, 10, indicates to RumbleDB that it should organize the data in ten partitions after downloading it, and process it in parallel. If you were reading from HDFS or S3, the parallelization of these partitions would be pushed down to the distributed file system.

JSONiq supports the relational algebra. For example, you can do a selection with a where clause, like so:

In [12]:
%%jsoniq
for $product in json-file("http://rumbledb.org/samples/products-small.json", 10)
where $product.quantity ge 995
return $product

Took: 5.105026006698608 ms
{"product": "toaster", "store-number": 97, "quantity": 997}
{"product": "phone", "store-number": 100, "quantity": 1000}
{"product": "tv", "store-number": 96, "quantity": 996}
{"product": "socks", "store-number": 99, "quantity": 999}
{"product": "shirt", "store-number": 95, "quantity": 995}
{"product": "toaster", "store-number": 98, "quantity": 998}
{"product": "tv", "store-number": 97, "quantity": 997}
{"product": "socks", "store-number": 100, "quantity": 1000}
{"product": "shirt", "store-number": 96, "quantity": 996}
{"product": "toaster", "store-number": 99, "quantity": 999}
{"product": "blender", "store-number": 95, "quantity": 995}
{"product": "tv", "store-number": 98, "quantity": 998}
{"product": "shirt", "store-number": 97, "quantity": 997}
{"product": "toaster", "store-number": 100, "quantity": 1000}
{"product": "blender", "store-number": 96, "quantity": 996}
{"product": "tv", "store-number": 99, "quantity": 999}
{"product": "broiler", "store-number": 

Notice that by default only the first 200 items are shown. In a typical setup, it is possible to output the result of a query to a distributed system, so it is also possible to output all the results if needed. In this case, however, as this is printed on your screen, it is more convenient not to materialize the entire sequence.

For a projection, there is project():

In [14]:
%%jsoniq
for $product in json-file("http://rumbledb.org/samples/products-small.json", 10)
where $product.quantity ge 995
return project($product, ("store-number", "product"))

Took: 8.84467601776123 ms
{"store-number": 97, "product": "toaster"}
{"store-number": 100, "product": "phone"}
{"store-number": 96, "product": "tv"}
{"store-number": 99, "product": "socks"}
{"store-number": 95, "product": "shirt"}
{"store-number": 98, "product": "toaster"}
{"store-number": 97, "product": "tv"}
{"store-number": 100, "product": "socks"}
{"store-number": 96, "product": "shirt"}
{"store-number": 99, "product": "toaster"}
{"store-number": 95, "product": "blender"}
{"store-number": 98, "product": "tv"}
{"store-number": 97, "product": "shirt"}
{"store-number": 100, "product": "toaster"}
{"store-number": 96, "product": "blender"}
{"store-number": 99, "product": "tv"}
{"store-number": 95, "product": "broiler"}
{"store-number": 98, "product": "shirt"}
{"store-number": 97, "product": "blender"}
{"store-number": 100, "product": "tv"}
{"store-number": 96, "product": "broiler"}
{"store-number": 99, "product": "shirt"}
{"store-number": 95, "product": "phone"}
{"store-number": 98, "pr

You can also page the results (like OFFSET and LIMIT in SQL) with a count clause and a where clause

In [15]:
%%jsoniq
for $product in json-file("http://rumbledb.org/samples/products-small.json", 10)
where $product.quantity ge 995
count $c
where $c gt 10 and $c le 20
return project($product, ("store-number", "product"))

Took: 11.857532024383545 ms
{"store-number": 95, "product": "blender"}
{"store-number": 98, "product": "tv"}
{"store-number": 97, "product": "shirt"}
{"store-number": 100, "product": "toaster"}
{"store-number": 96, "product": "blender"}
{"store-number": 99, "product": "tv"}
{"store-number": 95, "product": "broiler"}
{"store-number": 98, "product": "shirt"}
{"store-number": 97, "product": "blender"}
{"store-number": 100, "product": "tv"}


JSONiq also supports grouping with a group by clause:

In [17]:
%%jsoniq
for $product in json-file("http://rumbledb.org/samples/products-small.json", 10)
group by $store-number := $product.store-number
return {
    "store" : $store-number,
    "count" : count($product)
}

Took: 7.4556567668914795 ms
{"store": 64, "count": 1000}
{"store": 68, "count": 1000}
{"store": 42, "count": 1000}
{"store": 83, "count": 1000}
{"store": 54, "count": 1000}
{"store": 82, "count": 1000}
{"store": 96, "count": 1000}
{"store": 78, "count": 1000}
{"store": 41, "count": 1000}
{"store": 89, "count": 1000}
{"store": 62, "count": 1000}
{"store": 86, "count": 1000}
{"store": 58, "count": 1000}
{"store": 66, "count": 1000}
{"store": 70, "count": 1000}
{"store": 91, "count": 1000}
{"store": 100, "count": 1000}
{"store": 49, "count": 1000}
{"store": 14, "count": 1000}
{"store": 88, "count": 1000}
{"store": 97, "count": 1000}
{"store": 67, "count": 1000}
{"store": 15, "count": 1000}
{"store": 12, "count": 1000}
{"store": 4, "count": 1000}
{"store": 11, "count": 1000}
{"store": 74, "count": 1000}
{"store": 92, "count": 1000}
{"store": 5, "count": 1000}
{"store": 63, "count": 1000}
{"store": 19, "count": 1000}
{"store": 2, "count": 1000}
{"store": 10, "count": 1000}
{"store": 37, "co

As well as ordering with an order by clause:

In [18]:
%%jsoniq
for $product in json-file("http://rumbledb.org/samples/products-small.json", 10)
group by $store-number := $product.store-number
order by $store-number ascending
return {
    "store" : $store-number,
    "count" : count($product)
}

Took: 9.933311939239502 ms
{"store": 1, "count": 1000}
{"store": 2, "count": 1000}
{"store": 3, "count": 1000}
{"store": 4, "count": 1000}
{"store": 5, "count": 1000}
{"store": 6, "count": 1000}
{"store": 7, "count": 1000}
{"store": 8, "count": 1000}
{"store": 9, "count": 1000}
{"store": 10, "count": 1000}
{"store": 11, "count": 1000}
{"store": 12, "count": 1000}
{"store": 13, "count": 1000}
{"store": 14, "count": 1000}
{"store": 15, "count": 1000}
{"store": 16, "count": 1000}
{"store": 17, "count": 1000}
{"store": 18, "count": 1000}
{"store": 19, "count": 1000}
{"store": 20, "count": 1000}
{"store": 21, "count": 1000}
{"store": 22, "count": 1000}
{"store": 23, "count": 1000}
{"store": 24, "count": 1000}
{"store": 25, "count": 1000}
{"store": 26, "count": 1000}
{"store": 27, "count": 1000}
{"store": 28, "count": 1000}
{"store": 29, "count": 1000}
{"store": 30, "count": 1000}
{"store": 31, "count": 1000}
{"store": 32, "count": 1000}
{"store": 33, "count": 1000}
{"store": 34, "count": 10

JSONiq supports denormalized data, so you are not forced to aggregate after a grouping, you can also nest data like so:

In [19]:
%%jsoniq
for $product in json-file("http://rumbledb.org/samples/products-small.json", 10)
group by $store-number := $product.store-number
order by $store-number ascending
return {
    "store" : $store-number,
    "products" : [ distinct-values($product.product) ]
}

Took: 11.702539920806885 ms
{"store": 1, "products": ["shirt", "toaster", "phone", "blender", "tv", "socks", "broiler"]}
{"store": 2, "products": ["shirt", "toaster", "phone", "blender", "tv", "socks", "broiler"]}
{"store": 3, "products": ["shirt", "toaster", "phone", "blender", "tv", "socks", "broiler"]}
{"store": 4, "products": ["shirt", "toaster", "phone", "blender", "tv", "socks", "broiler"]}
{"store": 5, "products": ["shirt", "toaster", "phone", "blender", "tv", "socks", "broiler"]}
{"store": 6, "products": ["toaster", "phone", "blender", "tv", "socks", "broiler", "shirt"]}
{"store": 7, "products": ["toaster", "phone", "blender", "tv", "socks", "broiler", "shirt"]}
{"store": 8, "products": ["toaster", "phone", "blender", "tv", "socks", "broiler", "shirt"]}
{"store": 9, "products": ["toaster", "phone", "blender", "tv", "socks", "broiler", "shirt"]}
{"store": 10, "products": ["toaster", "phone", "blender", "tv", "socks", "broiler", "shirt"]}
{"store": 11, "products": ["phone", "blen

Or

In [25]:
%%jsoniq
for $product in json-file("http://rumbledb.org/samples/products-small.json", 10)
group by $store-number := $product.store-number
order by $store-number ascending
return {
    "store" : $store-number,
    "products" : [ project($product[position() le 10], ("product", "quantity")) ],
    "inventory" : sum($product.quantity)
}

Took: 13.3197660446167 ms
{"store": 1, "products": [{"product": "shirt", "quantity": 901}, {"product": "toaster", "quantity": 801}, {"product": "phone", "quantity": 701}, {"product": "blender", "quantity": 601}, {"product": "tv", "quantity": 501}, {"product": "socks", "quantity": 401}, {"product": "broiler", "quantity": 301}, {"product": "shirt", "quantity": 201}, {"product": "toaster", "quantity": 101}, {"product": "phone", "quantity": 1}], "inventory": 451000}
{"store": 2, "products": [{"product": "shirt", "quantity": 602}, {"product": "toaster", "quantity": 502}, {"product": "phone", "quantity": 402}, {"product": "blender", "quantity": 302}, {"product": "tv", "quantity": 202}, {"product": "socks", "quantity": 102}, {"product": "broiler", "quantity": 2}, {"product": "shirt", "quantity": 902}, {"product": "toaster", "quantity": 802}, {"product": "phone", "quantity": 702}], "inventory": 452000}
{"store": 3, "products": [{"product": "shirt", "quantity": 303}, {"product": "toaster", "qua

That's it! You know the basics of JSONiq. Now you can also download the RumbleDB jar and run it on your own laptop. Or [on a Spark cluster, reading data from and to HDFS](https://rumble.readthedocs.io/en/latest/Run%20on%20a%20cluster/), etc.
