# <center>RumbleDB sandbox</center>


This is a RumbleDB notebook that allows you to play with simple JSONiq queries.

It is a jupyter notebook that you can also download and execute on your own machine, but if you arrived here from the RumbleDB website, it is likely to be shown within Google's Colab environment.

To get started, we first need to make sure Java 17 (or 21) is installed.

The following cell was prepared to install Java when using Google Colab (or for Ubuntu generally), which is the setup that jsoniq.org and rumbledb.org link to. If you downloaded the notebook and have another Operating System than Linux, you need to use the Java install command corresponding to your Operating System for this to work.

The warnings can be ignored.

In [1]:
# We make sure Java 17 is installed.
!apt update
!apt install openjdk-17-jdk


zsh:1: command not found: apt
zsh:1: command not found: apt


Now we check the Java version. It should return 17. If it does not say 17 or 21, then this needs to be fixed for the rest of the notebook to work.

In [2]:
!java -version

openjdk version "17.0.13" 2024-10-15
OpenJDK Runtime Environment Temurin-17.0.13+11 (build 17.0.13+11)
OpenJDK 64-Bit Server VM Temurin-17.0.13+11 (build 17.0.13+11, mixed mode, sharing)


Now we install the jsoniq Python library.

In [5]:
!pip install jsoniq

Collecting jsoniq
  Downloading jsoniq-0.2.0a5-py3-none-any.whl.metadata (33 kB)
Downloading jsoniq-0.2.0a5-py3-none-any.whl (27.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.8/27.8 MB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: jsoniq
Successfully installed jsoniq-0.2.0a5

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


Here, the JSONiq queries are executed locally to the notebook. Advanced users can also run a large Spark cluster and execute JSONiq queries on Petabytes of data.

JSONiq queries can generally be evaluated in Python as follows. It is possible to provide as input Python dicts, lists, pandas DataFrames, and it is possible to retrieve the results as Python values, pandas DataFrames, etc.

In [1]:
from jsoniq import RumbleSession

rumble = RumbleSession.builder.withDelta().getOrCreate();

print(rumble.jsoniq('{ "foo": [ 6*7 ] }').json());

:: loading settings :: url = jar:file:/Users/ghislain/.pyenv/versions/3.11.9/lib/python3.11/site-packages/pyspark/jars/ivy-2.5.3.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /Users/ghislain/.ivy2.5.2/cache
The jars for the packages stored in: /Users/ghislain/.ivy2.5.2/jars
io.delta#delta-spark_2.13 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-092980cc-9585-49dc-b763-c7d685f1a7f2;1.0
	confs: [default]
	found io.delta#delta-spark_2.13;4.0.0 in local-m2-cache
	found io.delta#delta-storage;4.0.0 in local-m2-cache
	found org.antlr#antlr4-runtime;4.13.1 in local-m2-cache
:: resolution report :: resolve 65ms :: artifacts dl 2ms
	:: modules in use:
	io.delta#delta-spark_2.13;4.0.0 from local-m2-cache in [default]
	io.delta#delta-storage;4.0.0 from local-m2-cache in [default]
	org.antlr#antlr4-runtime;4.13.1 from local-m2-cache in [default]
	---------------------------------------------------------------------
	|         

({'foo': [42]},)


In [9]:
import os
del os.environ['SPARK_HOME']

However, for convenience, we activate the jsoniq magic. This means you can directly type the JSONiq queries in a notebook cell, without needing to wrap them inside rumble.jsoniq() calls. This makes it easier to read.

In [4]:
%load_ext jsoniqmagic
# We will display the first 5 items. This parameter can be changed.
rumble.getRumbleConf().setResultSizeCap(5)

JavaObject id=o56

In [6]:
%%jsoniq
{"foobar":1}

{
  "foobar": 1
}


## JSON

As explained on the [official JSON Web site](http://www.json.org/), JSON is a lightweight data-interchange format designed for humans as well as for computers. It supports as values:
- objects (string-to-value maps)
- arrays (ordered sequences of values)
- strings
- numbers
- booleans (true, false)
- null

JSONiq provides declarative querying and updating capabilities on JSON data.

## Elevator Pitch

JSONiq is based on XQuery, which is a W3C standard (like XML and HTML). XQuery is a very powerful declarative language that originally manipulates XML data, but it turns out that it is also a very good fit for manipulating JSON natively.
JSONiq, since it extends XQuery, is a very powerful general-purpose declarative programming language. Our experience is that, for the same task, you will probably write about 80% less code compared to imperative languages like JavaScript, Python or Ruby. Additionally, you get the benefits of strong type checking without actually having to write type declarations.
Here is an appetizer before we start the tutorial from scratch.


In [7]:
%%jsoniq

let $stores :=
[
  { "store number" : 1, "state" : "MA" },
  { "store number" : 2, "state" : "MA" },
  { "store number" : 3, "state" : "CA" },
  { "store number" : 4, "state" : "CA" }
]
let $sales := [
   { "product" : "broiler", "store number" : 1, "quantity" : 20  },
   { "product" : "toaster", "store number" : 2, "quantity" : 100 },
   { "product" : "toaster", "store number" : 2, "quantity" : 50 },
   { "product" : "toaster", "store number" : 3, "quantity" : 50 },
   { "product" : "blender", "store number" : 3, "quantity" : 100 },
   { "product" : "blender", "store number" : 3, "quantity" : 150 },
   { "product" : "socks", "store number" : 1, "quantity" : 500 },
   { "product" : "socks", "store number" : 2, "quantity" : 10 },
   { "product" : "shirt", "store number" : 3, "quantity" : 10 }
]
let $join :=
  for $store in $stores[], $sale in $sales[]
  where $store."store number" = $sale."store number"
  return {
    "nb" : $store."store number",
    "state" : $store.state,
    "sold" : $sale.product
  }
return [$join]



[
  {
    "nb": 1,
    "state": "MA",
    "sold": "broiler"
  },
  {
    "nb": 1,
    "state": "MA",
    "sold": "socks"
  },
  {
    "nb": 2,
    "state": "MA",
    "sold": "toaster"
  },
  {
    "nb": 2,
    "state": "MA",
    "sold": "toaster"
  },
  {
    "nb": 2,
    "state": "MA",
    "sold": "socks"
  },
  {
    "nb": 3,
    "state": "CA",
    "sold": "toaster"
  },
  {
    "nb": 3,
    "state": "CA",
    "sold": "blender"
  },
  {
    "nb": 3,
    "state": "CA",
    "sold": "blender"
  },
  {
    "nb": 3,
    "state": "CA",
    "sold": "shirt"
  }
]


## All JSON values are JSONiq, too

The first thing you need to know is that a well-formed JSON document is a JSONiq expression as well.
This means that you can copy-and-paste any JSON document into a query. The following are JSONiq queries that are "idempotent" (they just output themselves):

In [8]:
%%jsoniq
{ "pi" : 3.14, "sq2" : 1.4 }

{
  "pi": 3.14,
  "sq2": 1.4
}


In [9]:
%%jsoniq
[ 2, 3, 5, 7, 11, 13 ]

[
  2,
  3,
  5,
  7,
  11,
  13
]


In [10]:
%%jsoniq
{
      "operations" : [
        { "binary" : [ "and", "or"] },
        { "unary" : ["not"] }
      ],
      "bits" : [
        0, 1
      ]
    }

{
  "operations": [
    {
      "binary": [
        "and",
        "or"
      ]
    },
    {
      "unary": [
        "not"
      ]
    }
  ],
  "bits": [
    0,
    1
  ]
}


In [11]:
%%jsoniq
[ { "Question" : "Ultimate" }, ["Life", "the universe", "and everything"] ]

[
  {
    "Question": "Ultimate"
  },
  [
    "Life",
    "the universe",
    "and everything"
  ]
]


This works with objects, arrays (even nested), strings, numbers, booleans, null.

It also works the other way round: if your query outputs an object, you can use it as a JSON document.
JSONiq is a declarative language. This means that you only need to say what you want - the compiler will take care of the how. 

In the above queries, you are basically saying: I want to output this JSON content, and here it is.

In fact JSONiq makes JSON "dynamic": try to replace numbers with arithmetic formulas, keys with concatenations of strings, etc and see how the resulting JSON object is automatically created.

In [12]:
%%jsoniq
{
    "foo" : 2 + 2,
    "foo" || "bar" : if(2 gt 1) then true else false
}

{
  "foo": 4,
  "foobar": true
}


## Navigating an existing JSON dataset

Next, let us look at an existing dataset on the Web. We picked a [GitHub archive file](https://gharchive.org)
that we stored for convenience at this location: https://www.rumbledb.org/samples/git-archive.json.

Accessing a JSON dataset can be done in two ways depending on the exact format:

- If this is a file that contains a single JSON object spread over multiple lines, use json-doc(URL).
- If this is a file that contains one JSON object per line (JSON Lines), use json-lines(URL).

The GitHub archive dataset is in the JSON Lines format, so we open it with json-lines.

In [13]:
%%jsoniq
json-lines("http://www.rumbledb.org/samples/git-archive-small.json")

The query output 500 items, which is too many to display. Displaying the first 5 items:
{
  "id": "7045118886",
  "type": "PushEvent",
  "actor": {
    "id": 20090775,
    "login": "lainrose",
    "display_login": "lainrose",
    "gravatar_id": "",
    "url": "https://api.github.com/users/lainrose",
    "avatar_url": "https://avatars.githubusercontent.com/u/20090775?"
  },
  "repo": {
    "id": 115387592,
    "name": "lainrose/Python-Grammar",
    "url": "https://api.github.com/repos/lainrose/Python-Grammar"
  },
  "payload": {
    "push_id": 2226161589,
    "size": 1,
    "distinct_size": 1,
    "ref": "refs/heads/master",
    "head": "27a01fbdbec8e26daa40fc8faa052dd0be23836a",
    "before": "d6fce97b8de28a31d021c9a9f7ac939baa14d208",
    "commits": [
      {
        "sha": "27a01fbdbec8e26daa40fc8faa052dd0be23836a",
        "author": {
          "name": "lainrose",
          "email": "fb4676bf30682e2ece361fd363a69ad11779c42e@Naver.com"
        },
        "message": "Update Study Cont

This is a large file and the previous query output 500 JSON objects. To look closer, let us start looking at just the first object with a number predicate.

In [14]:
%%jsoniq
json-lines("http://www.rumbledb.org/samples/git-archive-small.json")[1]

{
  "id": "7045118886",
  "type": "PushEvent",
  "actor": {
    "id": 20090775,
    "login": "lainrose",
    "display_login": "lainrose",
    "gravatar_id": "",
    "url": "https://api.github.com/users/lainrose",
    "avatar_url": "https://avatars.githubusercontent.com/u/20090775?"
  },
  "repo": {
    "id": 115387592,
    "name": "lainrose/Python-Grammar",
    "url": "https://api.github.com/repos/lainrose/Python-Grammar"
  },
  "payload": {
    "push_id": 2226161589,
    "size": 1,
    "distinct_size": 1,
    "ref": "refs/heads/master",
    "head": "27a01fbdbec8e26daa40fc8faa052dd0be23836a",
    "before": "d6fce97b8de28a31d021c9a9f7ac939baa14d208",
    "commits": [
      {
        "sha": "27a01fbdbec8e26daa40fc8faa052dd0be23836a",
        "author": {
          "name": "lainrose",
          "email": "fb4676bf30682e2ece361fd363a69ad11779c42e@Naver.com"
        },
        "message": "Update Study Contents",
        "distinct": true,
        "url": "https://api.github.com/repos/lainrose/P

We can see that there are nested objects and arrays. This is perfect for JSONiq. Let us now figure out all the keys used in this dataset with the keys() function.

In [15]:
%%jsoniq
keys(json-lines("http://www.rumbledb.org/samples/git-archive-small.json"))

The query output 8 items, which is too many to display. Displaying the first 5 items:
"payload"
"org"
"public"
"repo"
"type"


Let us look closer at the key called "type". What values does it take? We can use dot-based navigation to navigate down to these values. This will work nicely on the entire dataset.

In [16]:
%%jsoniq
json-lines("http://www.rumbledb.org/samples/git-archive-small.json").type

The query output 500 items, which is too many to display. Displaying the first 5 items:
"PushEvent"
"PushEvent"
"PullRequestEvent"
"PushEvent"
"WatchEvent"


It looks like there are a lot of duplicates in there. Let us use distinct-values() to figure out all unique values.

In [17]:
%%jsoniq
distinct-values(json-lines("http://www.rumbledb.org/samples/git-archive-small.json").type)

The query output 13 items, which is too many to display. Displaying the first 5 items:
"CommitCommentEvent"
"GollumEvent"
"CreateEvent"
"WatchEvent"
"IssuesEvent"


So we see that for the key "type", all values are strings and there are only... how many, by the way? Let us use count().

In [18]:
%%jsoniq
count(distinct-values(json-lines("http://www.rumbledb.org/samples/git-archive-small.json").type))

13


So there are 13. Note that count() works just as well on the entire dataset, to know how many objects there are.

In [19]:
%%jsoniq
count(json-lines("http://www.rumbledb.org/samples/git-archive-small.json"))

500


Let us know look at nested objects. It seems the key "actor" has these, so let us now use the dot object lookup to find all these values.

In [20]:
%%jsoniq
json-lines("http://www.rumbledb.org/samples/git-archive-small.json").actor

The query output 500 items, which is too many to display. Displaying the first 5 items:
{
  "id": 20090775,
  "login": "lainrose",
  "display_login": "lainrose",
  "gravatar_id": "",
  "url": "https://api.github.com/users/lainrose",
  "avatar_url": "https://avatars.githubusercontent.com/u/20090775?"
}
{
  "id": 17426563,
  "login": "tumhopaasmere",
  "display_login": "tumhopaasmere",
  "gravatar_id": "",
  "url": "https://api.github.com/users/tumhopaasmere",
  "avatar_url": "https://avatars.githubusercontent.com/u/17426563?"
}
{
  "id": 1449578,
  "login": "daa84",
  "display_login": "daa84",
  "gravatar_id": "",
  "url": "https://api.github.com/users/daa84",
  "avatar_url": "https://avatars.githubusercontent.com/u/1449578?"
}
{
  "id": 22536460,
  "login": "thautwarm",
  "display_login": "thautwarm",
  "gravatar_id": "",
  "url": "https://api.github.com/users/thautwarm",
  "avatar_url": "https://avatars.githubusercontent.com/u/22536460?"
}
{
  "id": 18603467,
  "login": "markstachowsk

We can chain dot object lookups to navigate further down, for example to logins. Let us figure out how many distinct logins there are.

In [21]:
%%jsoniq
count(distinct-values(json-lines("http://www.rumbledb.org/samples/git-archive-small.json").actor.login))

374


The id field inside the actor object seems to be an integer. What is the highest value? The max() function also works at large scales, just like count() and also min(), avg() and sum().

In [22]:
%%jsoniq
max(json-lines("http://www.rumbledb.org/samples/git-archive-small.json").actor.id)

35003609


Alright, let us know look for nested arrays. There does not seem to have any inside the actor object, so let us try the key "payload". Let us just look at the first one.

In [23]:
%%jsoniq
json-lines("http://www.rumbledb.org/samples/git-archive-small.json")[1].payload

{
  "push_id": 2226161589,
  "size": 1,
  "distinct_size": 1,
  "ref": "refs/heads/master",
  "head": "27a01fbdbec8e26daa40fc8faa052dd0be23836a",
  "before": "d6fce97b8de28a31d021c9a9f7ac939baa14d208",
  "commits": [
    {
      "sha": "27a01fbdbec8e26daa40fc8faa052dd0be23836a",
      "author": {
        "name": "lainrose",
        "email": "fb4676bf30682e2ece361fd363a69ad11779c42e@Naver.com"
      },
      "message": "Update Study Contents",
      "distinct": true,
      "url": "https://api.github.com/repos/lainrose/Python-Grammar/commits/27a01fbdbec8e26daa40fc8faa052dd0be23836a"
    }
  ]
}


Here we see that there is a nested array associated with key "commits".

In [24]:
%%jsoniq
json-lines("http://www.rumbledb.org/samples/git-archive-small.json")[1].payload.commits

[
  {
    "sha": "27a01fbdbec8e26daa40fc8faa052dd0be23836a",
    "author": {
      "name": "lainrose",
      "email": "fb4676bf30682e2ece361fd363a69ad11779c42e@Naver.com"
    },
    "message": "Update Study Contents",
    "distinct": true,
    "url": "https://api.github.com/repos/lainrose/Python-Grammar/commits/27a01fbdbec8e26daa40fc8faa052dd0be23836a"
  }
]


In this case, there is only one object in this array. Is there, by any chance, any one of these arrays that has more than one commit? For this, we can use a Boolean predicate. Let us evaluate the predicate

size($$) gt 1

which uses the size function and the gt (greater than) comparison and where $$ is the current array being tested.

In [25]:
%%jsoniq
json-lines("http://www.rumbledb.org/samples/git-archive-small.json").payload.commits[size($$) gt 1]

The query output 30 items, which is too many to display. Displaying the first 5 items:
[
  {
    "sha": "95e600df9a5a669f53dc7de28147814678d12e81",
    "author": {
      "name": "Phil Gengler",
      "email": "e888d2bd6f13f82caa51a37c03d034c76f661ba3@pgengler.net"
    },
    "message": "Get days/tasks via JSONAPI",
    "distinct": true,
    "url": "https://api.github.com/repos/pgengler/todolist-client/commits/95e600df9a5a669f53dc7de28147814678d12e81"
  },
  {
    "sha": "d348f84df64c5473ba6a95a108e7c0263a434add",
    "author": {
      "name": "Phil Gengler",
      "email": "e888d2bd6f13f82caa51a37c03d034c76f661ba3@pgengler.net"
    },
    "message": "Update tests",
    "distinct": true,
    "url": "https://api.github.com/repos/pgengler/todolist-client/commits/d348f84df64c5473ba6a95a108e7c0263a434add"
  },
  {
    "sha": "9227c61c103ec1ee7b6dc8e126d14bc85fdf3dfd",
    "author": {
      "name": "Phil Gengler",
      "email": "e888d2bd6f13f82caa51a37c03d034c76f661ba3@pgengler.net"
    },


Let us just take the first one to have more visibility.

In [26]:
%%jsoniq
json-lines("http://www.rumbledb.org/samples/git-archive-small.json").payload.commits[size($$) gt 1][1]

[
  {
    "sha": "95e600df9a5a669f53dc7de28147814678d12e81",
    "author": {
      "name": "Phil Gengler",
      "email": "e888d2bd6f13f82caa51a37c03d034c76f661ba3@pgengler.net"
    },
    "message": "Get days/tasks via JSONAPI",
    "distinct": true,
    "url": "https://api.github.com/repos/pgengler/todolist-client/commits/95e600df9a5a669f53dc7de28147814678d12e81"
  },
  {
    "sha": "d348f84df64c5473ba6a95a108e7c0263a434add",
    "author": {
      "name": "Phil Gengler",
      "email": "e888d2bd6f13f82caa51a37c03d034c76f661ba3@pgengler.net"
    },
    "message": "Update tests",
    "distinct": true,
    "url": "https://api.github.com/repos/pgengler/todolist-client/commits/d348f84df64c5473ba6a95a108e7c0263a434add"
  },
  {
    "sha": "9227c61c103ec1ee7b6dc8e126d14bc85fdf3dfd",
    "author": {
      "name": "Phil Gengler",
      "email": "e888d2bd6f13f82caa51a37c03d034c76f661ba3@pgengler.net"
    },
    "message": "Migrate to unified List model",
    "distinct": true,
    "url": "https

We can expand it to a sequence of objects using the [] array unboxing syntax.

In [27]:
%%jsoniq
json-lines("http://www.rumbledb.org/samples/git-archive-small.json").payload.commits[size($$) gt 1][1][]

{
  "sha": "95e600df9a5a669f53dc7de28147814678d12e81",
  "author": {
    "name": "Phil Gengler",
    "email": "e888d2bd6f13f82caa51a37c03d034c76f661ba3@pgengler.net"
  },
  "message": "Get days/tasks via JSONAPI",
  "distinct": true,
  "url": "https://api.github.com/repos/pgengler/todolist-client/commits/95e600df9a5a669f53dc7de28147814678d12e81"
}
{
  "sha": "d348f84df64c5473ba6a95a108e7c0263a434add",
  "author": {
    "name": "Phil Gengler",
    "email": "e888d2bd6f13f82caa51a37c03d034c76f661ba3@pgengler.net"
  },
  "message": "Update tests",
  "distinct": true,
  "url": "https://api.github.com/repos/pgengler/todolist-client/commits/d348f84df64c5473ba6a95a108e7c0263a434add"
}
{
  "sha": "9227c61c103ec1ee7b6dc8e126d14bc85fdf3dfd",
  "author": {
    "name": "Phil Gengler",
    "email": "e888d2bd6f13f82caa51a37c03d034c76f661ba3@pgengler.net"
  },
  "message": "Migrate to unified List model",
  "distinct": true,
  "url": "https://api.github.com/repos/pgengler/todolist-client/commits/9227c

We can also lookup a specific position, say, the second object, with the [[ ]] array lookup syntax.

In [28]:
%%jsoniq
json-lines("http://www.rumbledb.org/samples/git-archive-small.json").payload.commits[size($$) gt 1][1][[2]]

{
  "sha": "d348f84df64c5473ba6a95a108e7c0263a434add",
  "author": {
    "name": "Phil Gengler",
    "email": "e888d2bd6f13f82caa51a37c03d034c76f661ba3@pgengler.net"
  },
  "message": "Update tests",
  "distinct": true,
  "url": "https://api.github.com/repos/pgengler/todolist-client/commits/d348f84df64c5473ba6a95a108e7c0263a434add"
}


And now, please hold for something awesome. We can unbox all arrays of the entire collection at the same time by just using the [] syntax on the entire dataset.

In [29]:
%%jsoniq
json-lines("http://www.rumbledb.org/samples/git-archive-small.json").payload.commits[]

The query output 422 items, which is too many to display. Displaying the first 5 items:
{
  "sha": "27a01fbdbec8e26daa40fc8faa052dd0be23836a",
  "author": {
    "name": "lainrose",
    "email": "fb4676bf30682e2ece361fd363a69ad11779c42e@Naver.com"
  },
  "message": "Update Study Contents",
  "distinct": true,
  "url": "https://api.github.com/repos/lainrose/Python-Grammar/commits/27a01fbdbec8e26daa40fc8faa052dd0be23836a"
}
{
  "sha": "45b2f857540d7d4286d1abef204aef167190be0f",
  "author": {
    "name": "tumhopaasmere",
    "email": "bcc6c59276ad7bbcd0b972dd58baaef7cccc22d4@mailinator.com"
  },
  "message": "GIT CloneShare Commit",
  "distinct": true,
  "url": "https://api.github.com/repos/tumhopaasmere/tumhopaasmere/commits/45b2f857540d7d4286d1abef204aef167190be0f"
}
{
  "sha": "ea291a9baea441ea815e822bba5e8c9f330542f7",
  "author": {
    "name": "thautwarm",
    "email": "820a7b45b87f3c40f5e1c273015816c9c19a8401@outlook.com"
  },
  "message": "API overview and example",
  "distinct": tr

These are objects. It is all too tempting to navigate further down with more dot object-lookup syntax. All at the same time, obviously. Let us figure out how many unique emails there are in all commits of all events.

In [30]:
%%jsoniq
count(distinct-values(json-lines("http://www.rumbledb.org/samples/git-archive-small.json").payload.commits[].author.email))

256


Now, how many unique emails are there in first commits?

In [31]:
%%jsoniq
count(distinct-values(json-lines("http://www.rumbledb.org/samples/git-archive-small.json").payload.commits[[1]].author.email))

233


You have now learned how to navigate large JSON datasets with the dot object lookup syntax, the [] array unboxing syntax, the [[ ]] array lookup syntax, number predicates, and Boolean predicates.

All of these work nicely on very large sequences, and you can chain them arbitrarily. In fact, this will all happen in parallel on the cores of your machine or even on a large cluster.

You also saw how to aggregate large sequences of values with min, max, count, avg and sum.

Finally, you saw how to eliminate duplicates with distinct-values.

# Variables

Some of the queries seen previously involve several chained lookups and function calls. It can become complex

In [32]:
%%jsoniq
count(distinct-values(json-lines("http://www.rumbledb.org/samples/git-archive-small.json").actor.login))

374


It is then a natural thing to use variables to store intermediate results. This can be achieved with a series of let clauses. The final result is then put in a return clause.

In [33]:
%%jsoniq
let $path := "http://www.rumbledb.org/samples/git-archive-small.json"
let $events := json-lines($path)
let $actors := $events.actor
let $logins := $actors.login
let $distinct-logins := distinct-values($logins)
return count($distinct-logins)

374


Note that types are not needed, however they exist! It is possible to add a static type to each variable.
Since values can be sequences, you can add suffixes for cardinality: * for a sequence of arbitrary length, ? for zero or one item, + for one or more items.

In [34]:
%%jsoniq
let $path as string := "http://www.rumbledb.org/samples/git-archive-small.json"
let $events as object* := json-lines($path)
let $actors as object* := $events.actor
let $logins as string* := $actors.login
let $distinct-logins as string* := distinct-values($logins)
let $count as integer := count($distinct-logins)
return $count

374


As you can see, variables can be used to store single items, as well as enormous sequences. RumbleDB will automatically select the best way to evaluate your query.

Note that it is possible to reuse variable names. However, these are not assignments: these are bindings. Reusing a variable name hides the previous binding.

In [35]:
%%jsoniq
let $v as string := "http://www.rumbledb.org/samples/git-archive-small.json"
let $v as object* := json-lines($v)
let $v as object* := $v.actor
let $v as string* := $v.login
let $v as string* := distinct-values($v)
let $v as integer := count($v)
return $v

374


## Iteration

It is possible to iterate on the elements in a sequence, like so:

In [36]:
%%jsoniq
for $i in 1 to 10
return $i * 2

The query output 10 items, which is too many to display. Displaying the first 5 items:
2
4
6
8
10


The sequence to iterator on can itself come from a dataset, such as the one we were using previously:

In [37]:
%%jsoniq
for $event in json-lines("http://www.rumbledb.org/samples/git-archive-small.json")
return size($event.payload.commits)

The query output 300 items, which is too many to display. Displaying the first 5 items:
1
1
1
0
4


For clauses can be mixed with let clauses:

In [38]:
%%jsoniq
let $path := "http://www.rumbledb.org/samples/git-archive-small.json"
for $event in json-lines($path)
let $commits := $event.payload.commits
return size($commits)

The query output 300 items, which is too many to display. Displaying the first 5 items:
1
1
1
0
4


And the results can also be nested in a more complex query: for example, let us compute the max of all these array sizes.

In [39]:
%%jsoniq
max(
  let $path := "http://www.rumbledb.org/samples/git-archive-small.json"
  for $event in subsequence(json-lines($path),1,1)
  return 1
)

1


A third kind of clause is the where clause: it allows you to filter events. Let us only keep those with more than 10 commits, and count them.

In [40]:
%%jsoniq
count(
  let $path := "http://www.rumbledb.org/samples/git-archive-small.json"
  for $event in json-lines($path)
  let $commits := $event.payload.commits
  where size($commits) gt 10
  return $event
)

6


## Simple calculations

Let us now look closer arithmetics, comparison and logic expressions. They are particularly useful in a where clause or in a Boolean predicate, however these expressions can be used just about anywhere as this is a functional language.

### Arithmetics

JSONiq works like a calculator and can do arithmetics with the four basic operations.

In [41]:
%%jsoniq
 (38 + 2) div 2 + 11 * 2


42


(mind the division operator which is the "div" keyword. The slash operator has different semantics).

Like JSON, JSONiq works with decimals and doubles:

In [42]:
%%jsoniq
 6.022e23 * 42

2.52924e+25


JSONiq also support modulos, integer division, and has a rich function library (trigonometry, logarithms, exponential, powers, etc).

## Comparison

Values (numbers, strings, dates, etc) can be compared with the binary operators eq, ne, gt, ge, lt and le.
Let us change the comparison used in the where clause with other kinds.

In [43]:
%%jsoniq
count(
  let $path := "http://www.rumbledb.org/samples/git-archive-small.json"
  for $event in json-lines($path)
  let $commits := $event.payload.commits
  where size($commits) gt 10
  return $event
)

6


In [44]:
%%jsoniq
count(
  let $path := "http://www.rumbledb.org/samples/git-archive-small.json"
  for $event in json-lines($path)
  let $commits := $event.payload.commits
  where size($commits) eq 10
  return $event
)

0


In [45]:
%%jsoniq
count(
  let $path := "http://www.rumbledb.org/samples/git-archive-small.json"
  for $event in json-lines($path)
  let $commits := $event.payload.commits
  where size($commits) ne 10
  return $event
)

300


In [46]:
%%jsoniq
count(
  let $path := "http://www.rumbledb.org/samples/git-archive-small.json"
  for $event in json-lines($path)
  let $commits := $event.payload.commits
  where size($commits) le 10
  return $event
)

294


Why not = or < or >=? This is because these are more powerful. In fact, they implicitly perform an existential quantification over the operands.

In [47]:
%%jsoniq
1 to 10 = 5

true


In [48]:
%%jsoniq
1 to 10 > 11 to 20

false


### Logical operations

JSONiq supports Boolean logic.

In [49]:
%%jsoniq
true and false

false


In [50]:
%%jsoniq
(true or false) and (false or true)

true


The unary not is also available:

In [51]:
%%jsoniq
not true

false


Note that JSONiq, unlike SQL, does two-valued logic. Nulls are automatically converted to false.

In [52]:
%%jsoniq
null and true

false


Some non-Booleans can also get converted. For example, non-empty strings are converted to true and empty strings to false.

In [53]:
%%jsoniq
not ""

true


In [54]:
%%jsoniq
not "non empty"

false


Zero is converted to false, non-zero numbers to true.

In [55]:
%%jsoniq
not 0

true


In [56]:
%%jsoniq
not 1e10

false


### Strings

JSONiq is capable of manipulating strings as well, using functions:


In [57]:
%%jsoniq
concat("Hello ", "Captain ", "Kirk")

"Hello Captain Kirk"


In [58]:
%%jsoniq
substring("Mister Spock", 8, 5)

"Spock"


JSONiq comes up with a rich string function library out of the box, inherited from its base language. These functions are listed [here](https://www.w3.org/TR/xpath-functions-30/) (actually, you will find many more for numbers, dates, etc).



### Sequences

Until now, we have only been working with single values (an object, an array, a number, a string, a boolean). JSONiq supports sequences of values. You can build a sequence using commas:


In [59]:
%%jsoniq
 (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

The query output 10 items, which is too many to display. Displaying the first 5 items:
1
2
3
4
5


In [60]:
%%jsoniq
1, true, 4.2e1, "Life"

1
true
42
"Life"


The "to" operator is very convenient, too:

In [61]:
%%jsoniq -df -pdf -j -t
 (1 to 100000000)[position() le 100]

                                                                                

+-----+
|value|
+-----+
|    1|
|    2|
|    3|
|    4|
|    5|
|    6|
|    7|
|    8|
|    9|
|   10|
|   11|
|   12|
|   13|
|   14|
|   15|
|   16|
|   17|
|   18|
|   19|
|   20|
+-----+
only showing top 20 rows


                                                                                

    value
0       1
1       2
2       3
3       4
4       5
..    ...
95     96
96     97
97     98
98     99
99    100

[100 rows x 1 columns]




The query output 100 items, which is too many to display. Displaying the first 5 items:
1
2
3
4
5
Response time: 24.965473890304565 ms


                                                                                

Some functions even work on sequences:

In [62]:
%%jsoniq
sum(1 to 100)

5050


In [63]:
%%jsoniq
string-join(("These", "are", "some", "words"), "-")

"These-are-some-words"


In [64]:
%%jsoniq
count(10 to 20)

11


In [65]:
%%jsoniq
avg(1 to 100)

50.5


Unlike arrays, sequences are flat. The sequence (3) is identical to the integer 3, and (1, (2, 3)) is identical to (1, 2, 3).

and even filter out some values:

In [66]:
%%jsoniq
let $sequence := 1 to 10
for $value in $sequence
let $square := $value * 2
where $square < 10
return $square

2
4
6
8


Note that you can only iterate over sequences, not arrays. To iterate over an array, you can obtain the sequence of its values with the [] operator, like so:


In [67]:
%%jsoniq
[1, 2, 3][]

1
2
3


### Conditions

You can make the output depend on a condition with an if-then-else construct:

In [68]:
%%jsoniq
for $x in 1 to 10
return if ($x < 5) then $x
                   else -$x

The query output 10 items, which is too many to display. Displaying the first 5 items:
1
2
3
4
-5


Note that the else clause is required - however, it can be the empty sequence () which is often when you need if only the then clause is relevant to you.

### Composability of Expressions

Now that you know of a couple of elementary JSONiq expressions, you can combine them in more elaborate expressions. For example, you can put any sequence of values in an array:

In [69]:
%%jsoniq
[ 1 to 10 ]

[
  1,
  2,
  3,
  4,
  5,
  6,
  7,
  8,
  9,
  10
]


Or you can dynamically compute the value of object pairs (or their key):

In [70]:
%%jsoniq
{
      "Greeting" : (let $d := "Mister Spock"
                    return concat("Hello, ", $d)),
      "Farewell" : string-join(("Live", "long", "and", "prosper"),
                               " ")
}

{
  "Greeting": "Hello, Mister Spock",
  "Farewell": "Live long and prosper"
}


You can dynamically generate object singletons (with a single pair):


In [71]:
%%jsoniq
{ concat("Integer ", 2) : 2 * 2 }

{
  "Integer 2": 4
}


and then merge lots of them into a new object with the {| |} notation:

In [72]:
%%jsoniq
{|
    for $i in 1 to 10
    return { concat("Square of ", $i) : $i * $i }
|}

{
  "Square of 1": 1,
  "Square of 2": 4,
  "Square of 3": 9,
  "Square of 4": 16,
  "Square of 5": 25,
  "Square of 6": 36,
  "Square of 7": 49,
  "Square of 8": 64,
  "Square of 9": 81,
  "Square of 10": 100
}


## JSON Navigation

Up to now, you have learnt how to compose expressions so as to do some computations and to build objects and arrays. It also works the other way round: if you have some JSON data, you can access it and navigate.
All you need to know is: JSONiq views
an array as an ordered list of values,
an object as a set of name/value pairs


### Objects

You can use the dot operator to retrieve the value associated with a key. Quotes are optional, except if the key has special characters such as spaces. It will return the value associated thereto:

In [73]:
%%jsoniq
let $person := {
    "first name" : "Sarah",
    "age" : 13,
    "gender" : "female",
    "friends" : [ "Jim", "Mary", "Jennifer"]
}
return $person."first name"

"Sarah"


You can also ask for all keys in an object:

In [74]:
%%jsoniq
let $person := {
    "name" : "Sarah",
    "age" : 13,
    "gender" : "female",
    "friends" : [ "Jim", "Mary", "Jennifer"]
}
return { "keys" : [ keys($person)] }

{
  "keys": [
    "name",
    "age",
    "gender",
    "friends"
  ]
}


### Arrays

The [[]] operator retrieves the entry at the given position:

In [75]:
%%jsoniq
let $friends := [ "Jim", "Mary", "Jennifer"]
return $friends[[1+1]]

"Mary"


It is also possible to get the size of an array:

In [76]:
%%jsoniq
let $person := {
    "name" : "Sarah",
    "age" : 13,
    "gender" : "female",
    "friends" : [ "Jim", "Mary", "Jennifer"]
}
return { "how many friends" : size($person.friends) }

{
  "how many friends": 3
}


Finally, the [] operator returns all elements in an array, as a sequence:

In [77]:
%%jsoniq
let $person := {
    "name" : "Sarah",
    "age" : 13,
    "gender" : "female",
    "friends" : [ "Jim", "Mary", "Jennifer"]
}
return $person.friends[]

"Jim"
"Mary"
"Jennifer"


### Relational Algebra

Do you remember SQL's SELECT FROM WHERE statements? JSONiq inherits selection, projection and join capability from XQuery, too.

In [78]:
%%jsoniq
let $stores :=
[
    { "store number" : 1, "state" : "MA" },
    { "store number" : 2, "state" : "MA" },
    { "store number" : 3, "state" : "CA" },
    { "store number" : 4, "state" : "CA" }
]
let $sales := [
    { "product" : "broiler", "store number" : 1, "quantity" : 20  },
    { "product" : "toaster", "store number" : 2, "quantity" : 100 },
    { "product" : "toaster", "store number" : 2, "quantity" : 50 },
    { "product" : "toaster", "store number" : 3, "quantity" : 50 },
    { "product" : "blender", "store number" : 3, "quantity" : 100 },
    { "product" : "blender", "store number" : 3, "quantity" : 150 },
    { "product" : "socks", "store number" : 1, "quantity" : 500 },
    { "product" : "socks", "store number" : 2, "quantity" : 10 },
    { "product" : "shirt", "store number" : 3, "quantity" : 10 }
]
let $join :=
    for $store in $stores[], $sale in $sales[]
    where $store."store number" = $sale."store number"
    return {
        "nb" : $store."store number",
        "state" : $store.state,
        "sold" : $sale.product
    }
return [$join]

[
  {
    "nb": 1,
    "state": "MA",
    "sold": "broiler"
  },
  {
    "nb": 1,
    "state": "MA",
    "sold": "socks"
  },
  {
    "nb": 2,
    "state": "MA",
    "sold": "toaster"
  },
  {
    "nb": 2,
    "state": "MA",
    "sold": "toaster"
  },
  {
    "nb": 2,
    "state": "MA",
    "sold": "socks"
  },
  {
    "nb": 3,
    "state": "CA",
    "sold": "toaster"
  },
  {
    "nb": 3,
    "state": "CA",
    "sold": "blender"
  },
  {
    "nb": 3,
    "state": "CA",
    "sold": "blender"
  },
  {
    "nb": 3,
    "state": "CA",
    "sold": "shirt"
  }
]


### Access datasets

RumbleDB can read input from many file systems and many file formats. If you are using our backend, you can only use json-doc() with any URI pointing to a JSON file and navigate it as you see fit. 

You can read data from your local disk, from S3, from HDFS, and also from the Web. For this tutorial, we'll read from the Web because, well, we are already on the Web.

We have put a sample at http://rumbledb.org/samples/products-small.json that contains 100,000 small objects like:



In [79]:
%%jsoniq
json-lines("http://rumbledb.org/samples/products-small.json", 10)[1] }

There was an error on line 1 in file:/Users/ghislain/Code/rumble/:

json-lines("http://rumbledb.org/samples/products-small.json", 10)[1] }
                                                                      ^

Code: [XPST0003]
Message: Parser failed. }
Metadata: file:/Users/ghislain/Code/rumble/:LINE:1:COLUMN:70:
This code can also be looked up in the documentation and specifications for more information.



The second parameter to json-lines, 10, indicates to RumbleDB that it should organize the data in ten partitions after downloading it, and process it in parallel. If you were reading from HDFS or S3, the parallelization of these partitions would be pushed down to the distributed file system.

JSONiq supports the relational algebra. For example, you can do a selection with a where clause, like so:

In [80]:
%%jsoniq -pdf
declare type mytype as {
    "product" : "string",
    "store-number" : "int",
    "quantity" : "decimal"
};
validate type mytype* { 
    for $product in json-lines("http://rumbledb.org/samples/products-small.json", 10)
    where $product.quantity ge 995
    return $product
}

     product  store-number                  quantity
0    toaster            97   997.0000000000000000000
1      phone           100  1000.0000000000000000000
2         tv            96   996.0000000000000000000
3      socks            99   999.0000000000000000000
4      shirt            95   995.0000000000000000000
..       ...           ...                       ...
595    socks           100  1000.0000000000000000000
596    shirt            96   996.0000000000000000000
597  toaster            99   999.0000000000000000000
598  blender            95   995.0000000000000000000
599       tv            98   998.0000000000000000000

[600 rows x 3 columns]


Notice that by default only the first 200 items are shown. In a typical setup, it is possible to output the result of a query to a distributed system, so it is also possible to output all the results if needed. In this case, however, as this is printed on your screen, it is more convenient not to materialize the entire sequence.

For a projection, there is project():

In [81]:
%%jsoniq
for $product in json-lines("http://rumbledb.org/samples/products-small.json", 10)
where $product.quantity ge 995
return project($product, ("store-number", "product"))

The query output 600 items, which is too many to display. Displaying the first 5 items:
{
  "store-number": 97,
  "product": "toaster"
}
{
  "store-number": 100,
  "product": "phone"
}
{
  "store-number": 96,
  "product": "tv"
}
{
  "store-number": 99,
  "product": "socks"
}
{
  "store-number": 95,
  "product": "shirt"
}


You can also page the results (like OFFSET and LIMIT in SQL) with a count clause and a where clause

In [82]:
%%jsoniq
for $product in json-lines("http://rumbledb.org/samples/products-small.json", 10)
where $product.quantity ge 995
count $c
where $c gt 10 and $c le 20
return project($product, ("store-number", "product"))

The query output 10 items, which is too many to display. Displaying the first 5 items:
{
  "store-number": 95,
  "product": "blender"
}
{
  "store-number": 98,
  "product": "tv"
}
{
  "store-number": 97,
  "product": "shirt"
}
{
  "store-number": 100,
  "product": "toaster"
}
{
  "store-number": 96,
  "product": "blender"
}


JSONiq also supports grouping with a group by clause:

In [83]:
%%jsoniq
for $product in json-lines("http://rumbledb.org/samples/products-small.json", 10)
group by $store-number := $product.store-number
return {
    "store" : $store-number,
    "count" : count($product)
}

                                                                                

The query output 100 items, which is too many to display. Displaying the first 5 items:
{
  "store": 1,
  "count": 1000
}
{
  "store": 2,
  "count": 1000
}
{
  "store": 3,
  "count": 1000
}
{
  "store": 4,
  "count": 1000
}
{
  "store": 5,
  "count": 1000
}


As well as ordering with an order by clause:

In [84]:
%%jsoniq
for $product in json-lines("http://rumbledb.org/samples/products-small.json", 10)
group by $store-number := $product.store-number
order by $store-number ascending
return {
    "store" : $store-number,
    "count" : count($product)
}

The query output 100 items, which is too many to display. Displaying the first 5 items:
{
  "store": 1,
  "count": 1000
}
{
  "store": 2,
  "count": 1000
}
{
  "store": 3,
  "count": 1000
}
{
  "store": 4,
  "count": 1000
}
{
  "store": 5,
  "count": 1000
}


JSONiq supports denormalized data, so you are not forced to aggregate after a grouping, you can also nest data like so:

In [85]:
%%jsoniq
for $product in json-lines("http://rumbledb.org/samples/products-small.json", 10)
group by $store-number := $product.store-number
order by $store-number ascending
return {
    "store" : $store-number,
    "products" : [ distinct-values($product.product) ]
}

The query output 100 items, which is too many to display. Displaying the first 5 items:
{
  "store": 1,
  "products": [
    "shirt",
    "toaster",
    "phone",
    "blender",
    "tv",
    "socks",
    "broiler"
  ]
}
{
  "store": 2,
  "products": [
    "shirt",
    "toaster",
    "phone",
    "blender",
    "tv",
    "socks",
    "broiler"
  ]
}
{
  "store": 3,
  "products": [
    "shirt",
    "toaster",
    "phone",
    "blender",
    "tv",
    "socks",
    "broiler"
  ]
}
{
  "store": 4,
  "products": [
    "shirt",
    "toaster",
    "phone",
    "blender",
    "tv",
    "socks",
    "broiler"
  ]
}
{
  "store": 5,
  "products": [
    "shirt",
    "toaster",
    "phone",
    "blender",
    "tv",
    "socks",
    "broiler"
  ]
}


Or

In [86]:
%%jsoniq
for $product in json-lines("http://rumbledb.org/samples/products-small.json", 10)
group by $store-number := $product.store-number
order by $store-number ascending
return {
    "store" : $store-number,
    "products" : [ project($product[position() le 10], ("product", "quantity")) ],
    "inventory" : sum($product.quantity)
}

The query output 100 items, which is too many to display. Displaying the first 5 items:
{
  "store": 1,
  "products": [
    {
      "product": "shirt",
      "quantity": 901
    },
    {
      "product": "toaster",
      "quantity": 801
    },
    {
      "product": "phone",
      "quantity": 701
    },
    {
      "product": "blender",
      "quantity": 601
    },
    {
      "product": "tv",
      "quantity": 501
    },
    {
      "product": "socks",
      "quantity": 401
    },
    {
      "product": "broiler",
      "quantity": 301
    },
    {
      "product": "shirt",
      "quantity": 201
    },
    {
      "product": "toaster",
      "quantity": 101
    },
    {
      "product": "phone",
      "quantity": 1
    }
  ],
  "inventory": 451000
}
{
  "store": 2,
  "products": [
    {
      "product": "shirt",
      "quantity": 602
    },
    {
      "product": "toaster",
      "quantity": 502
    },
    {
      "product": "phone",
      "quantity": 402
    },
    {
      "product":

That's it! You know the basics of JSONiq. Now you can also download the RumbleDB jar and run it on your own laptop. Or [on a Spark cluster, reading data from and to HDFS](https://rumble.readthedocs.io/en/latest/Run%20on%20a%20cluster/), etc.


In [87]:
%%jsoniq -df
1+1


No DataFrame available as no schema was automatically detected. If you still believe the output is structured enough, you could add a schema and validate expression explicitly to your query.

This is an example of how you can simply define a schema and wrap your query in a validate expression:

declare type local:mytype as {
    "product" : "string",
    "store-number" : "int",
    "quantity" : "decimal"
};
validate type local:mytype* { 
    for $product in json-lines("http://rumbledb.org/samples/products-small.json", 10)
    where $product.quantity ge 995
    return $product
}

RumbleDB keeps getting improved and automatic schema detection will improve as new versions get released. But even when RumbleDB fails to detect a schema, you can always declare your own schema as shown above.

For more information, see the documentation at https://docs.rumbledb.org/rumbledb-reference/types


In [88]:
%%jsoniq
1+1

2


In [89]:
rumble.lastResult.json()

(2,)

In [94]:
from jsoniq import RumbleSession
import pandas as pd

# The syntax to start a session is similar to that of Spark.
# A RumbleSession is a SparkSession that additionally knows about RumbleDB.
# All attributes and methods of SparkSession are also available on RumbleSession. 

rumble = RumbleSession.builder.getOrCreate();

# Just to improve readability when invoking Spark methods
# (such as spark.sql() or spark.createDataFrame()).
spark = rumble

##############################
###### Your first query ######
##############################

# Even though RumbleDB uses Spark internally, it can be used without any knowledge of Spark.

# Executing a query is done with rumble.jsoniq() like so. A query returns a sequence
# of items, here the sequence with just the integer item 2.
items = rumble.jsoniq('1+1')

# A sequence of items can simply be converted to a list of Python/JSON values with json().
# Since there is only one value in the sequence output by this query,
# we get a singleton list with the integer 2.
# Generally though, the results may contain zero, one, two, or more items.
python_list = items.json()
print(python_list)

############################################
##### More complex, standalone queries #####
############################################

# JSONiq is very powerful and expressive. You will find tutorials as well as a reference on JSONiq.org.

seq = rumble.jsoniq("""

let $stores :=
[
  { "store number" : 1, "state" : "MA" },
  { "store number" : 2, "state" : "MA" },
  { "store number" : 3, "state" : "CA" },
  { "store number" : 4, "state" : "CA" }
]
let $sales := [
   { "product" : "broiler", "store number" : 1, "quantity" : 20  },
   { "product" : "toaster", "store number" : 2, "quantity" : 100 },
   { "product" : "toaster", "store number" : 2, "quantity" : 50 },
   { "product" : "toaster", "store number" : 3, "quantity" : 50 },
   { "product" : "blender", "store number" : 3, "quantity" : 100 },
   { "product" : "blender", "store number" : 3, "quantity" : 150 },
   { "product" : "socks", "store number" : 1, "quantity" : 500 },
   { "product" : "socks", "store number" : 2, "quantity" : 10 },
   { "product" : "shirt", "store number" : 3, "quantity" : 10 }
]
let $join :=
  for $store in $stores[], $sale in $sales[]
  where $store."store number" = $sale."store number"
  return {
    "nb" : $store."store number",
    "state" : $store.state,
    "sold" : $sale.product
  }
return [$join]
""");

print(seq.json());

seq = rumble.jsoniq("""
for $product in json-lines("http://rumbledb.org/samples/products-small.json", 10)
group by $store-number := $product.store-number
order by $store-number ascending
return {
    "store" : $store-number,
    "products" : [ distinct-values($product.product) ]
}
""");
print(seq.json());

############################################################
###### Binding JSONiq variables to Python values ###########
############################################################

# It is possible to bind a JSONiq variable to a tuple of native Python values
# and then use it in a query.
# JSONiq, variables are bound to sequences of items, just like the results of JSONiq
# queries are sequence of items.
# A Python tuple will be seamlessly converted to a sequence of items by the library.
# Currently we only support strs, ints, floats, booleans, None, lists, and dicts.
# But if you need more (like date, bytes, etc) we will add them without any problem.
# JSONiq has a rich type system.
 
rumble.bind('$c', (1,2,3,4, 5, 6))
print(rumble.jsoniq("""
for $v in $c
let $parity := $v mod 2
group by $parity
return { switch($parity)
         case 0 return "even"
         case 1 return "odd"
         default return "?" : $v
}
""").json())

rumble.bind('$c', ([1,2,3],[4,5,6]))
print(rumble.jsoniq("""
for $i in $c
return [
  for $j in $i
  return { "foo" : $j }
]
""").json())

rumble.bind('$c', ({"foo":[1,2,3]},{"foo":[4,{"bar":[1,False, None]},6]}))
print(rumble.jsoniq('{ "results" : $c.foo[[2]] }').json())

# It is possible to bind only one value. The it must be provided as a singleton tuple.
# This is because in JSONiq, an item is the same a sequence of one item.
rumble.bind('$c', (42,))
print(rumble.jsoniq('for $i in 1 to $c return $i*$i').json())

# For convenience and code readability, you can also use bindOne().
rumble.bindOne('$c', 42)
print(rumble.jsoniq('for $i in 1 to $c return $i*$i').json())

##########################################################
##### Binding JSONiq variables to pandas DataFrames ######
##### Getting the output as a Pandas DataFrame      ######
##########################################################

# Creating a dummy pandas dataframe
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [30,25,35]};
pdf = pd.DataFrame(data);

# Binding a pandas dataframe
rumble.bind('$a',pdf);
seq = rumble.jsoniq('$a.Name')
# Getting the output as a pandas dataframe
print(seq.pdf())


################################################
##### Using Pyspark DataFrames with JSONiq #####
################################################

# The power users can also interface our library with pyspark DataFrames.
# JSONiq sequences of items can have billions of items, and our library supports this
# out of the box: it can also run on clusters on AWS Elastic MapReduce for example.
# But your laptop is just fine, too: it will spread the computations on your cores.
# You can bind a DataFrame to a JSONiq variable. JSONiq will recognize this
# DataFrame as a sequence of object items.

# Create a data frame also similar to Spark (but using the rumble object).
data = [("Alice", 30), ("Bob", 25), ("Charlie", 35)];
columns = ["Name", "Age"];
df = spark.createDataFrame(data, columns);

# This is how to bind a JSONiq variable to a dataframe. You can bind as many variables as you want.
rumble.bind('$a', df);

# This is how to run a query. This is similar to spark.sql().
# Since variable $a was bound to a DataFrame, it is automatically declared as an external variable
# and can be used in the query. In JSONiq, it is logically a sequence of objects.
res = rumble.jsoniq('$a.Name');

# There are several ways to collect the outputs, depending on the user needs but also
# on the query supplied.
# This returns a list containing one or several of "DataFrame", "RDD", "PUL", "Local"
# If DataFrame is in the list, df() can be invoked.
# If RDD is in the list, rdd() can be invoked.
# If Local is the list, items() or json() can be invokved, as well as the local iterator API.
modes = res.availableOutputs();
for mode in modes:
    print(mode)

#########################################################
###### Manipulating DataFrames with SQL and JSONiq ######
#########################################################

# If the output of the JSONiq query is structured (i.e., RumbleDB was able to detect a schema),
# then we can extract a regular data frame that can be further processed with spark.sql() or rumble.jsoniq().
df = res.df();
df.show();

# We are continuously working on the detection of schemas and RumbleDB will get better at it with them.
# JSONiq is a very powerful language and can also produce heterogeneous output "by design". Then you need
# to use rdd() instead of df(), or to collect the list of JSON values (see further down). Remember
# that availableOutputs() tells you what is at your disposal.

# A DataFrame output by JSONiq can be reused as input to a Spark SQL query.
# (Remember that rumble is a wrapper around a SparkSession object, so you can use rumble.sql() just like spark.sql())
df.createTempView("myview")
df2 = spark.sql("SELECT * FROM myview").toDF("name");
df2.show();

# A DataFrame output by Spark SQL can be reused as input to a JSONiq query.
rumble.bind('$b', df2);
seq2 = rumble.jsoniq("for $i in 1 to 5 return $b");
df3 = seq2.df();
df3.show();

# And a DataFrame output by JSONiq can be reused as input to another JSONiq query.
rumble.bind('$b', df3);
seq3 = rumble.jsoniq("$b[position() lt 3]");
df4 = seq3.df();
df4.show();

#########################
##### Local access ######
#########################

# This materializes the rows as items.
# The items are accessed with the RumbleDB Item API.
list = res.items();
for result in list:
    print(result.getStringValue())

# This streams through the items one by one
res.open();
while (res.hasNext()):
    print(res.next().getStringValue());
res.close();

################################################################################################################
###### Native Python/JSON Access for bypassing the Item API (but losing on the richer JSONiq type system) ######
################################################################################################################

# This method directly gets the result as JSON (dict, list, strings, ints, etc).
jlist = res.json();
for str in jlist:
    print(str);

# This streams through the JSON values one by one.
res.open();
while(res.hasNext()):
    print(res.nextJSON());
res.close();

# This gets an RDD of JSON values that can be processed by Python
rdd = res.rdd();
print(rdd.count());
for str in rdd.take(10):
    print(str);

###################################################
###### Write back to the disk (or data lake) ######
###################################################

# It is also possible to write the output to a file locally or on a cluster. The API is similar to that of Spark dataframes.
# Note that it creates a directory and stores the (potentially very large) output in a sharded directory.
# RumbleDB was already tested with up to 64 AWS machines and 100s of TBs of data.
# Of course the examples below are so small that it makes more sense to process the results locally with Python,
# but this shows how GBs or TBs of data obtained from JSONiq can be written back to disk.
seq = rumble.jsoniq("$a.Name");
seq.write().mode("overwrite").json("outputjson");
seq.write().mode("overwrite").parquet("outputparquet");

seq = rumble.jsoniq("1+1");
seq.write().mode("overwrite").text("outputtext");

(2,)
([{'nb': 1, 'state': 'MA', 'sold': 'broiler'}, {'nb': 1, 'state': 'MA', 'sold': 'socks'}, {'nb': 2, 'state': 'MA', 'sold': 'toaster'}, {'nb': 2, 'state': 'MA', 'sold': 'toaster'}, {'nb': 2, 'state': 'MA', 'sold': 'socks'}, {'nb': 3, 'state': 'CA', 'sold': 'toaster'}, {'nb': 3, 'state': 'CA', 'sold': 'blender'}, {'nb': 3, 'state': 'CA', 'sold': 'blender'}, {'nb': 3, 'state': 'CA', 'sold': 'shirt'}],)
({'store': 1, 'products': ['shirt', 'toaster', 'phone', 'blender', 'tv', 'socks', 'broiler']}, {'store': 2, 'products': ['shirt', 'toaster', 'phone', 'blender', 'tv', 'socks', 'broiler']}, {'store': 3, 'products': ['shirt', 'toaster', 'phone', 'blender', 'tv', 'socks', 'broiler']}, {'store': 4, 'products': ['shirt', 'toaster', 'phone', 'blender', 'tv', 'socks', 'broiler']}, {'store': 5, 'products': ['shirt', 'toaster', 'phone', 'blender', 'tv', 'socks', 'broiler']}, {'store': 6, 'products': ['toaster', 'phone', 'blender', 'tv', 'socks', 'broiler', 'shirt']}, {'store': 7, 'products': ['

                                                                                

+-----+
| name|
+-----+
|Alice|
|  Bob|
+-----+

Alice
Bob
Charlie
Alice
Bob
Charlie
Alice
Bob
Charlie
"Alice"
"Bob"
"Charlie"
3
Alice
Bob
Charlie


In [95]:
rumble.getRumbleConf().setPrintIteratorTree(False)
rumble.getRumbleConf().setResultSizeCap(200)

JavaObject id=o1512

In [96]:
%%jsoniq -pdf
for $i in 1 to 5 return $b



       name
0     Alice
1       Bob
2   Charlie
3     Alice
4       Bob
..      ...
70      Bob
71  Charlie
72    Alice
73      Bob
74  Charlie

[75 rows x 1 columns]


                                                                                

In [4]:
df.createOrReplaceTempView("myinput")

NameError: name 'df' is not defined

In [5]:
df.show()

NameError: name 'df' is not defined

In [6]:
rumble.jsoniq('table("myinput")').json()

Py4JJavaError: An error occurred while calling o34.runQuery.
: org.rumbledb.exceptions.CannotRetrieveResourceException: There was an error on line 1 in file:/Users/ghislain/Code/rumble/:

table("myinput")
^

Code: [FODC0002]
Message: Table myinput not found in hive catalogue.
Metadata: file:/Users/ghislain/Code/rumble/:LINE:1:COLUMN:0:
This code can also be looked up in the documentation and specifications for more information.

	at org.rumbledb.compiler.InferTypeVisitor.tryAnnotateSpecificFunctions(InferTypeVisitor.java:658)
	at org.rumbledb.compiler.InferTypeVisitor.visitFunctionCall(InferTypeVisitor.java:770)
	at org.rumbledb.compiler.InferTypeVisitor.visitFunctionCall(InferTypeVisitor.java:1)
	at org.rumbledb.expressions.primary.FunctionCallExpression.accept(FunctionCallExpression.java:78)
	at org.rumbledb.expressions.AbstractNodeVisitor.visit(AbstractNodeVisitor.java:120)
	at org.rumbledb.expressions.AbstractNodeVisitor.visitDescendants(AbstractNodeVisitor.java:126)
	at org.rumbledb.compiler.InferTypeVisitor.visitStatementsAndOptionalExpr(InferTypeVisitor.java:2928)
	at org.rumbledb.compiler.InferTypeVisitor.visitStatementsAndOptionalExpr(InferTypeVisitor.java:1)
	at org.rumbledb.expressions.scripting.statement.StatementsAndOptionalExpr.accept(StatementsAndOptionalExpr.java:46)
	at org.rumbledb.expressions.AbstractNodeVisitor.visit(AbstractNodeVisitor.java:120)
	at org.rumbledb.expressions.AbstractNodeVisitor.visitDescendants(AbstractNodeVisitor.java:126)
	at org.rumbledb.expressions.AbstractNodeVisitor.defaultAction(AbstractNodeVisitor.java:132)
	at org.rumbledb.expressions.AbstractNodeVisitor.visitProgram(AbstractNodeVisitor.java:468)
	at org.rumbledb.expressions.scripting.Program.accept(Program.java:33)
	at org.rumbledb.expressions.AbstractNodeVisitor.visit(AbstractNodeVisitor.java:120)
	at org.rumbledb.expressions.AbstractNodeVisitor.visitDescendants(AbstractNodeVisitor.java:126)
	at org.rumbledb.compiler.InferTypeVisitor.visitMainModule(InferTypeVisitor.java:2529)
	at org.rumbledb.compiler.InferTypeVisitor.visitMainModule(InferTypeVisitor.java:1)
	at org.rumbledb.expressions.module.MainModule.accept(MainModule.java:93)
	at org.rumbledb.expressions.AbstractNodeVisitor.visit(AbstractNodeVisitor.java:120)
	at org.rumbledb.compiler.VisitorHelpers.inferTypes(VisitorHelpers.java:60)
	at org.rumbledb.compiler.VisitorHelpers.parseJSONiqMainModule(VisitorHelpers.java:251)
	at org.rumbledb.compiler.VisitorHelpers.parseMainModule(VisitorHelpers.java:174)
	at org.rumbledb.compiler.VisitorHelpers.parseMainModuleFromQuery(VisitorHelpers.java:161)
	at org.rumbledb.api.Rumble.runQuery(Rumble.java:66)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:569)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:184)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:108)
	at java.base/java.lang.Thread.run(Thread.java:840)


In [108]:
rumble.sql("SELECT * FROM myinput").show()

+-------+
|  value|
+-------+
|  Alice|
|    Bob|
|Charlie|
+-------+



In [32]:
%%jsoniq -u
declare type local:mytype as { "foo" : "string", "bar" : "int" };
create collection table("test3") with (validate type local:mytype* { {"foo" : "foo", "bar":1},{"foo":"bar", "bar":2 } })

Updates applied successfully.


In [33]:
rumble.sql("SELECT * FROM test3").show()

+---+---+-----+--------+---------------+------+-------------+
|foo|bar|rowID|rowOrder|mutabilityLevel|pathIn|tableLocation|
+---+---+-----+--------+---------------+------+-------------+
|foo|  1|    0|     0.0|              0|      |        test3|
|bar|  2|    1|     1.0|              0|      |        test3|
+---+---+-----+--------+---------------+------+-------------+



In [40]:
%%jsoniq -j
table("test3")

{
  "foo": "foo",
  "bar": 3
}
{
  "foo": "bar",
  "bar": 4
}
{
  "foo": "foo",
  "bar": 1
}
{
  "foo": "bar",
  "bar": 2
}


In [48]:
%%jsoniq -pdf
table("test3")

   foo  bar  rowID      rowOrder  mutabilityLevel pathIn tableLocation
0  foo    3      6 -96296.296296                0                test3
1  bar    4      7 -92592.592592                0                test3
2  foo    3      4 -88888.888889                0                test3
3  bar    4      5 -77777.777778                0                test3
4  foo    3      2 -66666.666667                0                test3
5  bar    4      3 -33333.333334                0                test3
6  foo    1      0      0.000000                0                test3
7  bar    2      1      1.000000                0                test3


In [49]:
%%jsoniq -df
table("test3")

+---+---+-----+-------------+---------------+------+-------------+
|foo|bar|rowID|     rowOrder|mutabilityLevel|pathIn|tableLocation|
+---+---+-----+-------------+---------------+------+-------------+
|foo|  3|    6|-96296.296296|              0|      |        test3|
|bar|  4|    7|-92592.592592|              0|      |        test3|
|foo|  3|    4|-88888.888889|              0|      |        test3|
|bar|  4|    5|-77777.777778|              0|      |        test3|
|foo|  3|    2|-66666.666667|              0|      |        test3|
|bar|  4|    3|-33333.333334|              0|      |        test3|
|foo|  1|    0|          0.0|              0|      |        test3|
|bar|  2|    1|          1.0|              0|      |        test3|
+---+---+-----+-------------+---------------+------+-------------+



In [47]:
%%jsoniq -u
declare type local:mytype as { "foo" : "string", "bar" : "int" };
insert (validate type local:mytype* { {"foo" : "foo", "bar":3},{"foo":"bar", "bar":4 } }) first into collection table("test3")

Updates applied successfully.


In [44]:
rumble.sql("SELECT * FROM test3").show()

+---+---+-----+-------------+---------------+------+-------------+
|foo|bar|rowID|     rowOrder|mutabilityLevel|pathIn|tableLocation|
+---+---+-----+-------------+---------------+------+-------------+
|foo|  3|    4|-88888.888889|              0|      |        test3|
|bar|  4|    5|-77777.777778|              0|      |        test3|
|foo|  3|    2|-66666.666667|              0|      |        test3|
|bar|  4|    3|-33333.333334|              0|      |        test3|
|foo|  1|    0|          0.0|              0|      |        test3|
|bar|  2|    1|          1.0|              0|      |        test3|
+---+---+-----+-------------+---------------+------+-------------+



In [30]:
%%jsoniq -u
table("test3")

No Pending Update List (PUL) available to apply.
The query output 6 items, which is too many to display. Displaying the first 5 items:
{
  "foo": "foo",
  "bar": 3
}
{
  "foo": "bar",
  "bar": 4
}
{
  "foo": "foo",
  "bar": 3
}
{
  "foo": "bar",
  "bar": 4
}
{
  "foo": "foo",
  "bar": 1
}


In [31]:
%%jsoniq -u
delete collection table("test3")

Updates applied successfully.


In [50]:
%%jsoniq -u
declare type local:mytype as { "foo" : "string", "bar" : "int" };
create collection delta-file("test3") with (validate type local:mytype* { {"foo" : "foo", "bar":1},{"foo":"bar", "bar":2 } })

Updates applied successfully.


In [54]:
%%jsoniq -pdf
delta-file("spark-warehouse/test3")

   foo  bar        rowID      rowOrder  mutabilityLevel pathIn  \
0  foo    3            0 -88888.888889                0          
1  bar    4            1 -77777.777778                0          
2  foo    3   8589934592 -96296.296296                0          
3  bar    4   8589934593 -92592.592592                0          
4  foo    3  17179869184 -66666.666667                0          
5  bar    4  17179869185 -33333.333334                0          
6  bar    2  25769803776      1.000000                0          
7  foo    1  34359738368      0.000000                0          

                                       tableLocation  
0  file:/Users/ghislain/Code/rumble/spark-warehou...  
1  file:/Users/ghislain/Code/rumble/spark-warehou...  
2  file:/Users/ghislain/Code/rumble/spark-warehou...  
3  file:/Users/ghislain/Code/rumble/spark-warehou...  
4  file:/Users/ghislain/Code/rumble/spark-warehou...  
5  file:/Users/ghislain/Code/rumble/spark-warehou...  
6  file:/Users/ghisl

In [67]:
rumble.sql("""
DROP TABLE IF EXISTS sample_table
""");

rumble.sql("""
CREATE TABLE sample_table (
    rowOrder DOUBLE,
    name STRING,
    age INT
);
""");
rumble.sql("""
INSERT INTO sample_table VALUES
    (1, 'Alice', 30),
    (2, 'Bob', 25),
    (3, 'Charlie', 35);
    """)

DataFrame[]

In [68]:
rumble.sql("SELECT * FROM sample_table").show()

+--------+-------+---+
|rowOrder|   name|age|
+--------+-------+---+
|     3.0|Charlie| 35|
|     1.0|  Alice| 30|
|     2.0|    Bob| 25|
+--------+-------+---+



In [71]:
%%jsoniq -j
table("sample_table")


{
  "name": "Alice",
  "age": 30
}
{
  "name": "Bob",
  "age": 25
}
{
  "name": "Charlie",
  "age": 35
}
