# Motivation

## Where we are

In several previous chapters, we studied the storage and processing of
denormalized data: the syntax, the models, the validation, data frames.

We also looked at document stores, which are database systems
that manage denormalized data not as a data lake, but as ETL-based
collections with an optimized storage format hidden from the user and
additional managed features such as indices that make queries faster
and document-level atomicity.

If we now look at the stack we have built for denormalized data,
both as data lake and as managed database systems, this is not fully
satisfactory. Indeed, an important component is still missing, which is
the query language. Indeed, with what we have covered, users are left
with two options to handle denormalized datasets:

- They can use an API within an imperative host language (e.g.,
Pandas in Python, or the MongoDB API in JavaScript, or the
Spark RDD API in Java or Scala).

- Or they can push SQL, including ad-hoc extensions to support
nestedness, to its limits.


APIs are unsatisfactory for complex analytics use cases. They are
very convenient and suitable for Data Engineers that implement more
data management layers on top of these APIs, but they are not suitable
for end users who want to run queries to analyse data

There is agreement in the database community that SQL is more
satisfactory for the case that data is flat and homogeneous (relational
tables). Take the following query for example:

```sql
SELECT foo
FROM input
WHERE bar = "foobar"; 
```

which is much simpler to write than the following lower-level equivalent in APIs. 

With Spark RDDs:

```python
rdd1 = sc.textFile("hdfs:///input.json")
rdd2 = rdd1.map(line => parseJSON(line))
rdd3 = rdd2.filter(obj => obj.bar = "foobar")
rdd4 = rdd3.map(obj => obj.foo)
rdd4.saveAsTextFile("hdfs:///output.json")
```

With the Spark DataFrame API:

```python
df1 = spark.read.json("hdfs:///input.json")
df2 = df1.filter(df1[’bar’] = "foobar")
df3 = df2.select(df2[’foo’])
df3.show()
```

Or even if nesting SQL in a host language, there is still additional logic needed to access the collection:

```python
df1 = spark.read.json("hdfs:///input.json")
df1.createGlobalTempView("input")
df2 = df1.sql("SELECT foo FROM input WHERE bar = ’foobar’ ")
df2.show()
```

SQL, possibly extended with a few dots, lateral view syntax and explode-like functions, will work nicely for the most simple use cases. But as soon as more complex functionality is needed, e.g., the dataset is nested up to a depth of 10, or the user would like to denormalize a dataset from relational tables to a single, nested collection, or the user would like to explore and discover a dataset that is heterogeneous, this approach becomes intractable. At best, this leads to gigantic and hard-to-read SQL queries. At worst, there is no way to express the use case in SQL. In both cases, the user ends up writing most of the code in an imperative language, invoking the lower-level API or nesting and chaining simple blocks of SQL. A concrete example that such is the case in the real world is the high-energy-physics community, who are working with dataframes APIs rather than SQL in spite of their (nested) data being homogeneous.

Here are a few examples of use cases that are simple enough to be
manageable in Spark SQL, although they require some effort to be read
and understood:

```sql
SELECT *
FROM person
LATERAL VIEW EXPLODE(ARRAY(30, 60)) tabelName AS c_age
LATERAL VIEW EXPLODE(ARRAY(40, 80)) AS d_age;
```

and 

```sql
SELECT key, values, collect_list(value + 1) AS values_plus_one
FROM nested_data
LATERAL VIEW explode(values) T AS value
GROUP BY key, values
```

But let us look at another use case that is simple to express: in the GitHub dataset, for each event, what are all the commits by the top-committer within this event?

In Spark SQL, this is what the query looks like:

```sql
WITH Commits AS (
    SELECT
        FORMAT('%s-%i', created_at, ROW_NUMBER() OVER(PARTITION BY created_at)) AS event_id,
        payload.shas
    FROM `bigquery-public-data.samples.github_nested`
    WHERE ARRAY_LENGTH(payload.shas) > 0
),
CommitterFrequency AS (
    SELECT
        Commits.event_id AS event_id,
        actor_email,
        COUNT(*) AS commit_count
    FROM Commits, UNNEST(shas)
    GROUP BY event_id, actor_email
),
MaxCommitterFrequency AS (
    SELECT
        event_id,
        MAX(commit_count) AS commit_count
    FROM CommitterFrequency
    GROUP BY event_id
),
TopCommitters AS (
    SELECT
        c.event_id,
        ANY_VALUE(c.actor_email) AS actor_email
    FROM CommitterFrequency c,
         MaxCommitterFrequency m
    WHERE c.event_id = m.event_id AND
          c.commit_count = m.commit_count
    GROUP BY c.event_id
),
TopCommitterCommits AS (
    SELECT c.event_id, commits
    FROM Commits c,
         UNNEST(shas) AS commits,
         TopCommitters tc
    WHERE c.event_id = tc.event_id AND
          commits.actor_email = tc.actor_email
)
SELECT ARRAY_AGG(commits) AS shas
FROM TopCommitterCommits
GROUP BY event_id;
```

In the language we will study in this chapter for denormalized data, this is how the query looks like. As you can see, it is much more compact and easier to read:

```python
for $e in $events
let $top-committer := (
  for $c in $e.commits[]
  group by $c.author
  stable order by count($c) descending
  return $c.author)[1]
  return [
    $e.commits[][$$.author eq $top-committer]
]
```

This language is called JSONiq and it is tailor-made for denormalized data. It offers a data-independent layer on top of both data lakes
and ETL-based, database management systems, similar to what SQL offers for (flat and homogeneous) relational tables.

98% of JSONiq is directly the same as a W3C standard, XQuery, which is a language offering this functionality for XML datasets. This functionality is the fruit of more than 20 years of work by a two-digitsized working group from many different companies, many of them with extensive SQL experience (or themselves SQL editors) who carefully discussed every single corner case, leading to a long and precise, publicly available specification. JSONiq is basically XQuery without XML and with (instead) JSON, similar to how one could bake a blueberry cake by using a strawberry cake recipe and simply replacing the strawberries with blueberries. This is a reminder that JSON and XML are very similar when it comes to querying, because both models are based on tree structures. JSONiq was born during the working group discussions on how to add support for maps and arrays to the language and became a standalone language optimized specifically for JSON. XQuery in its latest version supports maps and arrays, and is best suitable in an environment where both XML and JSON co-exist, which is out of scope in this course.

## Denormalized data

What do we mean with denormalized data? Let us simply remind that
it is characterized with two features: nestedness, and heterogeneity

<div style="text-align: center;">
    <img src="https://raw.githubusercontent.com/RumbleDB/bigdata-exercises/master/Big_Data/JSONiq-Jupyter/Textbook_images/280.png" alt="image" style="width: 60%; height: auto;">
</div>

Consider the following example of a JSON Lines dataset (note that
the objects are displayed on multiple lines only so it fits on the printed
page, in reality they would each be on a single line):

<div style="text-align: left;">

```json
{
    "Name" : { "First" : "Albert", "Last" : "Einstein" },
    "Countries" : [ "D", "I", "CH", "A", "BE", "US" ]
}
{
    "Name" : { "First" : "Srinivasa", "Last" : "Ramanujan" },
    "Countries" : [ "IN", "UK" ]
}
{
    "Name" : { "First" : "Kurt", "Last" : "G¨odel" },
    "Countries" : [ "CZ", "A", "US" ]
}
{
    "Name" : { "First" : "John", "Last" : "Nash" },
    "Countries" : "US"
}
{
    "Name" : { "First" : "Alan", "Last" : "Turing" },
    "Countries" : "UK"
}
{
    "Name" : { "First" : "Maryam", "Last" : "Mirzakhani" },
    "Countries" : [ "IR", "US" ]
}
{
    "Name" : "Pythagoras",
    "Countries" : [ "GR" ]
}
{
    "Name" : { "First" : "Nicolas", "Last" : "Bourbaki" },
    "Number" : 9,
    "Countries" : null
}
```
</div> <br>

If one wants to put it in a DataFrame in order to use Spark SQL, this is what one will get:

<div style="text-align: center;">
    <img src="https://raw.githubusercontent.com/RumbleDB/bigdata-exercises/master/Big_Data/JSONiq-Jupyter/Textbook_images/281.png" alt="image" style="width: 60%; height: auto;">
</div> 

As can be seen, the columns “Name” and “Countries” are typed as string, because the system could not deal with the fact that they contain a mix of atomic and structured types. What is in fact happening is that the burden of dealing with heterogeneity is pushed up to the end user, who will be forced to write a lot of additional code in the host language (Python, Java...) to attempt to parse back these strings one by one and decide what to do. This, in turn, is likely to force the user to make a heavy use of UDFs (User-Defined Functions), which are blackboxes that can be called from SQL and with user code inside. But UDFs are very inefficient compared to native SQL execution, because first they need to be registered and shipped to all nodes in the cluster (which not all distributed processing technologies can do efficiently), and second because the SQL optimizer has no idea of what there is inside, which prevents many (otherwise possible) optimizations from kicking in.

In fact, denormalized datasets should not be seen as “broken tables pushed to their limits”, but rather as collections of trees.

The GitHub archive dataset is a good illustration of this: it contains 2,900,000,000 events, each as a JSON document, taking 7.6 TB of space uncompressed. 10% of all the paths (you can think of them as “data frame columns” although, for a heterogeneous dataset, viewing it as a
data frame is not very suitable) have mixed types. Furthermore, there are 1,300 such paths in total, although each event only uses 100 of them. One could think of fitting this into relational tables or dataframes with 1,300 attributes, but 1,300 is already beyond what many relational database systems can handle reasonably well.

## Features of a query language

A query language for datasets has three main features.

### Declarative

First, it is declarative. This means that the users do not focus on how the query is computed, but on what it should return. Thus, the database engine enjoys the flexibility to figure out the most efficient and fastest plan of execution to return the results.

### Functional

Second, it is functional. This means that the query language is made of composable expressions that nest with each other, like a Lego game. Many, but not all, expressions can be seen as functions that take as input the output of their children expressions, and send their output to their parent expressions. However, the syntax of a good functional language should look nothing like a simple chain of function calls with parentheses and lambdas everywhere (this would then be an API, not a query language; examples of APIs are the Spark transformation APIs or Pandas): rather, expression syntax is carefully and naturally designed for ease of write and read. In complement to expressions (typically 20 or 30 different kinds of expressions), a rich function library (this time, with actual function call syntax) completes the expressions to a fully functional language.

### Set-based

Finally, it is set-based, in the sense that the values taken and returned by expressions are not only single values (scalars), but are large sequences of items (in the case of SQL, an item is a row). In spite of the set-based terminology, set-based languages can still have bag or list semantics, in that they can allow for duplicates and sequences might be ordered on the logical level.

## Query languages for denormalized data

The landscape for denormalized data querying is very different from that of structured, relational data: indeed, for structured data, SQL is undisputed.

For denormalized data though, sadly, the number of languages keeps increasing: the oldest ones being XQuery, JSONiq, but then now also JMESPath, SpahQL, JSON Query, PartiQL, UnQL, N1QL, Object- Path, JSONPath, ArangoDB Query Language (AQL), SQL++, GraphQL, MRQL, Asterix Query Language (AQL), RQL. One day, we expect the market to consolidate.

But the good news is that these languages share common features. In this course, we focus on JSONiq for several reasons:

- It is fully documented;
 
- Most of its syntax, semantics, function library and type system relies on a W3C standard (XPath/XQuery), meaning that a group of 30+ very smart people with expertise and experience on SQL swept into every corner to define the language;
 
- It has several independent implementations.

Having learned JSONiq, it will be very easy for the reader to learn any one of the other languages in the future.

## JSONiq as a data calculator

The smoothest start with JSONiq is to understand it as a data calculator.

Run the cell below to connect to your Rumble server.

In [1]:
%load_ext rumbledb
%env RUMBLEDB_SERVER=http://rumble:9090/jsoniq

env: RUMBLEDB_SERVER=http://rumble:9090/jsoniq


In particular, it can perform arithmetics

In [2]:
%%jsoniq

1 + 1 

Took: 0.0337522029876709 ms
2


In [3]:
%%jsoniq

3+2*4

Took: 0.034287214279174805 ms
11


but also comparison and logic:

In [4]:
%%jsoniq

2 < 5 

Took: 0.03225231170654297 ms
true


It is, however, more powerful than a common calculator and supports more complex constructs, for example variable binding:

In [5]:
%%jsoniq

let $i := 2
return $i + 1

Took: 0.04697990417480469 ms
3


It also supports all JSON values. Any copy-pasted JSON value literally returns itself:

In [6]:
%%jsoniq

[ 1, 2, 3 ]

Took: 0.042824506759643555 ms
[1, 2, 3]


In [7]:
%%jsoniq

{ "foo" : 1 }

Took: 0.0674588680267334 ms
{"foo": 1}


Things start to become interesting with object and array navigation, with dots and square brackets:

In [8]:
%%jsoniq

{ "foo" : 1 }.foo

Took: 0.06833863258361816 ms
1


In [9]:
%%jsoniq

[3, 4, 5][[1]]

Took: 0.060555458068847656 ms
3


In [10]:
%%jsoniq

{ "foo" : [ 3, 4, 5 ] }.foo[[1]] + 3

Took: 0.043822526931762695 ms
6


Another difference with a calculator is that a query can return multiple items, as a sequence:

In [11]:
%%jsoniq

{ "foo" : [ 3, 4, 5 ] }.foo[]

Took: 0.04198503494262695 ms
3
4
5


In [12]:
%%jsoniq

1 to 4

Took: 0.038079023361206055 ms
1
2
3
4


In [13]:
%%jsoniq

for $i in 3 to 5
return { string($i) : $i * $i }

Took: 0.05246424674987793 ms
{"3": 9}
{"4": 16}
{"5": 25}


In [14]:
%%jsoniq

for $i in { "foo" : [ 3, 4, 5 ] }.foo[]
return { string($i) : $i * $i }

Took: 0.04096269607543945 ms
{"3": 9}
{"4": 16}
{"5": 25}


And, unlike a calculator, it can access storage (data lakes, the Web, etc):

```python
keys(
  for $i in json-file("s3://bucket/myfiles/json/*")
  return $i
)
```

```python
keys(
  for $i in parquet-file(
    "s3://bucket/myfiles/parquet"
  )
  return $i
)
```

# The JSONiq Data Model

Every expression of the JSONiq “data calculator” returns a sequence of items. Always.

<div style="text-align: center;">
    <img src="https://raw.githubusercontent.com/RumbleDB/bigdata-exercises/master/Big_Data/JSONiq-Jupyter/Textbook_images/282.png" alt="image" style="width: 60%; height: auto;">
</div> 


An item can be either an object, an array, an atomic item, or a function item. For the purpose of this course, we ignore function items, but it might interest the reader to know that function items can be used to represent Machine Learning models, as JSONiq can be used for training and testing models as well.

Atomic items can be any of the “core” JSON types: strings, numbers (integers, decimals, doubles...), booleans and nulls, but JSONiq has a much richer type system, which we covered in Chapter 7 in the context of both JSound and XML Schema.

Sequences of items are flat, in that sequences cannot be nested. But they scale massively and can contain billions or trillions of items. The only way to nest lists is to use arrays (which can be recursively nested). If you read the previous chapters, you can think of sequences as being similar to RDDs in Spark, or to JSON lines documents with billions of JSON objects.

Sequences can be homogeneous (e.g., a million integers, or a million of JSON objects valid against a flat schema, logically equivalent
to a relational table) or heterogeneous (a messy list of various items: objects, arrays, integers, etc).

One item is logically the same as a sequence of one item. Thus, when we say that 1+1 returns 2, it in in fact means that it returns a singleton sequence with one item, which is the integer 2.

A sequence can also be empty. Caution, the empty sequence is not the same logically as a null item.

# Navigation

A good angle to start with JSONiq is to take a large dataset and discover its structure. The online sandbox has examples of this for the GitHub archive dataset, which is continuously growing. Each hour of log can be downloaded from URIs with the pattern

https://data.gharchive.org/2022-11-01-0.json.gz

where you can pick the year, month, date and hour of the day.

For the purpose of this textbook, we will pick made-up data patterns to illustrate our point. Let us consider the following JSON document, consisting of a single, large JSON object (possibly on multiple line, as is common for single JSON documents). Let us assume it is named file.json.

In [15]:
import json

# Define the JSON object 
json_obj = {
  "o": [
    {
      "a": {
        "b": [
          { "c": 1, "d": "a" }
        ]
      }
    },
    {
      "a": {
        "b": [
          { "c": 1, "d": "f" },
          { "c": 2, "d": "b" }
        ]
      }
    },
    {
      "a": {
        "b": [
          { "c": 4, "d": "e" },
          { "c": 8, "d": "d" },
          { "c": 3, "d": "c" }
        ]
      }
    },
    {
      "a": {
        "b": []
      }
    },
    {
      "a": {
        "b": [
          { "c": 3, "d": "h" },
          { "c": 9, "d": "z" }
        ]
      }
    },
    {
      "a": {
        "b": [
          { "c": 4, "d": "g" }
        ]
      }
    },
    {
      "a": {
        "b": [
          { "c": 3, "d": "l" },
          { "c": 1, "d": "m" },
          { "c": 0, "d": "k" }
        ]
      }
    }
  ]
}

# Create the JSON file 
with open("file.json", "w") as file:
    json.dump(json_obj, file, indent=3)

We can open this document and return its contents.

In [16]:
%%jsoniq 

json-doc("file.json")

Took: 0.10423088073730469 ms
{"o": [{"a": {"b": [{"c": 1, "d": "a"}]}}, {"a": {"b": [{"c": 1, "d": "f"}, {"c": 2, "d": "b"}]}}, {"a": {"b": [{"c": 4, "d": "e"}, {"c": 8, "d": "d"}, {"c": 3, "d": "c"}]}}, {"a": {"b": []}}, {"a": {"b": [{"c": 3, "d": "h"}, {"c": 9, "d": "z"}]}}, {"a": {"b": [{"c": 4, "d": "g"}]}}, {"a": {"b": [{"c": 3, "d": "l"}, {"c": 1, "d": "m"}, {"c": 0, "d": "k"}]}}]}


To display it in a more visual format, we run the following cell.

In [17]:
# Read the JSON file
with open("file.json", "r") as file:
    data = json.load(file)

# Print the contents of the JSON file
print(json.dumps(data, indent=3))

{
   "o": [
      {
         "a": {
            "b": [
               {
                  "c": 1,
                  "d": "a"
               }
            ]
         }
      },
      {
         "a": {
            "b": [
               {
                  "c": 1,
                  "d": "f"
               },
               {
                  "c": 2,
                  "d": "b"
               }
            ]
         }
      },
      {
         "a": {
            "b": [
               {
                  "c": 4,
                  "d": "e"
               },
               {
                  "c": 8,
                  "d": "d"
               },
               {
                  "c": 3,
                  "d": "c"
               }
            ]
         }
      },
      {
         "a": {
            "b": []
         }
      },
      {
         "a": {
            "b": [
               {
                  "c": 3,
                  "d": "h"
               },
               {
                  "c

We are going to start our dataset exploration with JSON navigation. Navigating semi-structured data is several decades old and was pioneered on XML with XPath. JSON navigation uses similar ideas, but is considerably simpler than XML navigation. The general idea of navigation is that it is possible to “dive” into the nested data with dots and square brackets (originally, these were slashes with XPath) – all in parallel: starting with an original collection of objects (or, possibly, a single document), each step (i.e., for each dot and each pair of square brackets) goes down the nested structure and returns another sequence of nested items.

The experience feels like scanning the entire collection, moving down the nested structure in parallel<sup>1</sup>. Some steps might massively increase the sequence size (i.e., when unboxing arrays); some other steps might on the contrary contract the sequence to a smaller one (i.e., when filtering, or in the presence of heterogeneity when parts of the collection go deeper than others).

---

1. In XPath, it is possible to also move up the nested structure (e.g., with the
.. syntax, similar to what is used in file system paths to go to parent directories),
because there are backpointers to parent nodes, or to move aside to siblings. This
is achieved with so-called “axes.” This feature does not typically exist in JSON
models, in which one only goes down the trees to the children and descendants.

## Object lookups (dot syntax)

It is possible to navigate into objets with dots, similar to object-oriented programming. For example, this is how we can get the value associated with key o in the document (which is a sequence of one object).

In [18]:
%%jsoniq 

json-doc("file.json").o

Took: 0.08969354629516602 ms
[{"a": {"b": [{"c": 1, "d": "a"}]}}, {"a": {"b": [{"c": 1, "d": "f"}, {"c": 2, "d": "b"}]}}, {"a": {"b": [{"c": 4, "d": "e"}, {"c": 8, "d": "d"}, {"c": 3, "d": "c"}]}}, {"a": {"b": []}}, {"a": {"b": [{"c": 3, "d": "h"}, {"c": 9, "d": "z"}]}}, {"a": {"b": [{"c": 4, "d": "g"}]}}, {"a": {"b": [{"c": 3, "d": "l"}, {"c": 1, "d": "m"}, {"c": 0, "d": "k"}]}}]


The above result can be displayed in a more readable way as follows 

```json
[
  { "a": { "b": [ { "c": 1, "d": "a" } ] } },
  { "a": { "b": [ { "c": 1, "d": "f" }, { "c": 2, "d": "b" } ] } },
  { "a": { "b": [ { "c": 4, "d": "e" }, { "c": 8, "d": "d" }, { "c": 3, "d": "c" } ] } },
  { "a": { "b": [ ] } },
  { "a": { "b": [ { "c": 3, "d": "h" }, { "c": 9, "d": "z" } ] } },
  { "a": { "b": [ { "c": 4, "d": "g" } ] } },
  { "a": { "b": [ { "c": 3, "d": "l" }, { "c": 1, "d": "m" }, { "c": 0, "d": "k" } ] } }
]
```

This returned an array, more precisely, a sequence of *one* array item.

## Array unboxing (empty square bracket syntax)

We can unbox the array, meaning, extract its members as a sequence of seven object items, with empty square brackets, like so:

In [19]:
%%jsoniq 

json-doc("file.json").o[]

Took: 0.10631704330444336 ms
{"a": {"b": [{"c": 1, "d": "a"}]}}
{"a": {"b": [{"c": 1, "d": "f"}, {"c": 2, "d": "b"}]}}
{"a": {"b": [{"c": 4, "d": "e"}, {"c": 8, "d": "d"}, {"c": 3, "d": "c"}]}}
{"a": {"b": []}}
{"a": {"b": [{"c": 3, "d": "h"}, {"c": 9, "d": "z"}]}}
{"a": {"b": [{"c": 4, "d": "g"}]}}
{"a": {"b": [{"c": 3, "d": "l"}, {"c": 1, "d": "m"}, {"c": 0, "d": "k"}]}}


## Parallel navigation

The dot syntax, in fact, works on sequences, too. It will extract the value associated with a key in every object of the sequence (anything else than an object is ignored and thrown away):

In [20]:
%%jsoniq 

json-doc("file.json").o[].a

Took: 0.08327984809875488 ms
{"b": [{"c": 1, "d": "a"}]}
{"b": [{"c": 1, "d": "f"}, {"c": 2, "d": "b"}]}
{"b": [{"c": 4, "d": "e"}, {"c": 8, "d": "d"}, {"c": 3, "d": "c"}]}
{"b": []}
{"b": [{"c": 3, "d": "h"}, {"c": 9, "d": "z"}]}
{"b": [{"c": 4, "d": "g"}]}
{"b": [{"c": 3, "d": "l"}, {"c": 1, "d": "m"}, {"c": 0, "d": "k"}]}


Array unboxing works on sequences, too. Note how all the members are concatenated to a single, merged sequence, similar to a flatMap in Apache Spark.

In [21]:
%%jsoniq 

json-doc("file.json").o[].a.b[]

Took: 0.09607362747192383 ms
{"c": 1, "d": "a"}
{"c": 1, "d": "f"}
{"c": 2, "d": "b"}
{"c": 4, "d": "e"}
{"c": 8, "d": "d"}
{"c": 3, "d": "c"}
{"c": 3, "d": "h"}
{"c": 9, "d": "z"}
{"c": 4, "d": "g"}
{"c": 3, "d": "l"}
{"c": 1, "d": "m"}
{"c": 0, "d": "k"}


## Filtering with predicates (simple square bracket syntax)

It is possible to filter any sequence with a predicate, where $$ in the predicate refers to the current item being tested. Let us only keep those objects that associate c with 3:

In [22]:
%%jsoniq 

json-doc("file.json").o[].a.b[][$$.c = 3]

Took: 0.1009969711303711 ms
{"c": 3, "d": "c"}
{"c": 3, "d": "h"}
{"c": 3, "d": "l"}


It is also possible to access the item at position n in a sequence with this same notation: if what is inside the square brackets is a Boolean, then it acts as a filtering predicate; if it is an integer, it acts as a position:

In [23]:
%%jsoniq 

json-doc("file.json").o[].a.b[][5]

Took: 0.1009361743927002 ms
{"c": 8, "d": "d"}


## Array lookup (double square bracket syntax)

To access the n-th member of an array, you can use double-squarebrackets, like so:

In [24]:
%%jsoniq 

json-doc("file.json").o[[2]].a

Took: 0.09721922874450684 ms
{"b": [{"c": 1, "d": "f"}, {"c": 2, "d": "b"}]}


Like dot object navigation and unboxing, double square brackets (array navigation) work with sequences as well. For any array that has less elements than the requested position, as well as for items that are not arrays, no items are contributed to the output:

In [25]:
%%jsoniq 

json-doc("file.json").o[].a.b[[2]]

Took: 0.0914607048034668 ms
{"c": 2, "d": "b"}
{"c": 8, "d": "d"}
{"c": 9, "d": "z"}
{"c": 1, "d": "m"}


## A common pitfall: Array lookup vs. Sequence predicates

Do not confuse sequence positions (single square brackets) with array positions (double square brackets)! The difference is easy to see on a simple example involving a sequence of two arrays with two members each:

In [26]:
%%jsoniq 

([1, 2], [3, 4])[2]

Took: 0.04576563835144043 ms
[3, 4]


In [27]:
%%jsoniq 

([1, 2], [3, 4])[[2]]

Took: 0.05052518844604492 ms
2
4


# Schema discovery

We now go on with more simple querying functionality related to discovering datasets with an unknown structure.

## Collections

While there exist files that contain a single JSON document (or a single XML document), many datasets are in fact found in the form of large collections of smaller objects (as in document stores). 

Such collections are accessed with a function call together with a name or (if reading from a data lake) a path. In RumbleDB, a JSON Lines dataset is accessed with the function *json-file*.

In [28]:
%%jsoniq 

json-file("https://www.rumbledb.org/samples/git-archive.json")

Took: 8.631576299667358 ms
{"id": "7045118886", "type": "PushEvent", "actor": {"id": 20090775, "login": "lainrose", "display_login": "lainrose", "gravatar_id": "", "url": "https://api.github.com/users/lainrose", "avatar_url": "https://avatars.githubusercontent.com/u/20090775?"}, "repo": {"id": 115387592, "name": "lainrose/Python-Grammar", "url": "https://api.github.com/repos/lainrose/Python-Grammar"}, "payload": {"push_id": 2226161589, "size": 1, "distinct_size": 1, "ref": "refs/heads/master", "head": "27a01fbdbec8e26daa40fc8faa052dd0be23836a", "before": "d6fce97b8de28a31d021c9a9f7ac939baa14d208", "commits": [{"sha": "27a01fbdbec8e26daa40fc8faa052dd0be23836a", "author": {"name": "lainrose", "email": "fb4676bf30682e2ece361fd363a69ad11779c42e@Naver.com"}, "message": "Update Study Contents", "distinct": true, "url": "https://api.github.com/repos/lainrose/Python-Grammar/commits/27a01fbdbec8e26daa40fc8faa052dd0be23836a"}]}, "public": true, "created_at": "2018-01-01T15:00:00Z"}
{"id": "70451

It is a good idea to look at the first object of a collection to get a rough idea of what the layout looks like (although there is always the risk of heterogeneity, and there is no guarantee all objects look the same):


In [29]:
%%jsoniq 

json-file("https://www.rumbledb.org/samples/git-archive.json")[1]

Took: 10.593615531921387 ms
{"id": "7045118886", "type": "PushEvent", "actor": {"id": 20090775, "login": "lainrose", "display_login": "lainrose", "gravatar_id": "", "url": "https://api.github.com/users/lainrose", "avatar_url": "https://avatars.githubusercontent.com/u/20090775?"}, "repo": {"id": 115387592, "name": "lainrose/Python-Grammar", "url": "https://api.github.com/repos/lainrose/Python-Grammar"}, "payload": {"push_id": 2226161589, "size": 1, "distinct_size": 1, "ref": "refs/heads/master", "head": "27a01fbdbec8e26daa40fc8faa052dd0be23836a", "before": "d6fce97b8de28a31d021c9a9f7ac939baa14d208", "commits": [{"sha": "27a01fbdbec8e26daa40fc8faa052dd0be23836a", "author": {"name": "lainrose", "email": "fb4676bf30682e2ece361fd363a69ad11779c42e@Naver.com"}, "message": "Update Study Contents", "distinct": true, "url": "https://api.github.com/repos/lainrose/Python-Grammar/commits/27a01fbdbec8e26daa40fc8faa052dd0be23836a"}]}, "public": true, "created_at": "2018-01-01T15:00:00Z"}


One can also look at the top N objects using the position function in a predicate, which returns the position in the sequence of the current item being tested by the predicate (similar to the LIMIT clause in SQL):

In [30]:
%%jsoniq 

json-file("https://www.rumbledb.org/samples/git-archive.json")[position() le 5]

Took: 9.783782243728638 ms
{"id": "7045118886", "type": "PushEvent", "actor": {"id": 20090775, "login": "lainrose", "display_login": "lainrose", "gravatar_id": "", "url": "https://api.github.com/users/lainrose", "avatar_url": "https://avatars.githubusercontent.com/u/20090775?"}, "repo": {"id": 115387592, "name": "lainrose/Python-Grammar", "url": "https://api.github.com/repos/lainrose/Python-Grammar"}, "payload": {"push_id": 2226161589, "size": 1, "distinct_size": 1, "ref": "refs/heads/master", "head": "27a01fbdbec8e26daa40fc8faa052dd0be23836a", "before": "d6fce97b8de28a31d021c9a9f7ac939baa14d208", "commits": [{"sha": "27a01fbdbec8e26daa40fc8faa052dd0be23836a", "author": {"name": "lainrose", "email": "fb4676bf30682e2ece361fd363a69ad11779c42e@Naver.com"}, "message": "Update Study Contents", "distinct": true, "url": "https://api.github.com/repos/lainrose/Python-Grammar/commits/27a01fbdbec8e26daa40fc8faa052dd0be23836a"}]}, "public": true, "created_at": "2018-01-01T15:00:00Z"}
{"id": "70451

## Getting all top-level keys

The *keys* function retrieves all keys. It can be called on the entire sequence of objects and will return all unique keys found (at the top level) in that collection:

In [31]:
%%jsoniq 

keys(json-file("https://www.rumbledb.org/samples/git-archive.json"))

Took: 12.732788801193237 ms
"repo"
"org"
"actor"
"public"
"type"
"created_at"
"id"
"payload"


## Getting unique values associated with a key

With dot object lookup, we can look at all the values associated with a key like so:

In [32]:
%%jsoniq 

json-file("https://www.rumbledb.org/samples/git-archive.json").type

Took: 9.45344352722168 ms
"PushEvent"
"PushEvent"
"PullRequestEvent"
"PushEvent"
"WatchEvent"
"PushEvent"
"GollumEvent"
"PushEvent"
"PullRequestEvent"
"PushEvent"
"PushEvent"
"IssuesEvent"
"PushEvent"
"PullRequestEvent"
"WatchEvent"
"PushEvent"
"WatchEvent"
"PullRequestEvent"
"IssueCommentEvent"
"PushEvent"
"PushEvent"
"ForkEvent"
"PushEvent"
"IssueCommentEvent"
"CreateEvent"
"IssuesEvent"
"PushEvent"
"PushEvent"
"PushEvent"
"CreateEvent"
"CreateEvent"
"PushEvent"
"ForkEvent"
"CreateEvent"
"CreateEvent"
"PushEvent"
"IssueCommentEvent"
"PushEvent"
"PushEvent"
"ForkEvent"
"WatchEvent"
"PushEvent"
"DeleteEvent"
"PushEvent"
"PushEvent"
"IssueCommentEvent"
"PushEvent"
"CreateEvent"
"WatchEvent"
"PushEvent"
"PushEvent"
"ForkEvent"
"PullRequestEvent"
"IssuesEvent"
"PushEvent"
"WatchEvent"
"PushEvent"
"PushEvent"
"PushEvent"
"PushEvent"
"PushEvent"
"PushEvent"
"PushEvent"
"PushEvent"
"PushEvent"
"PushEvent"
"CreateEvent"
"PushEvent"
"PushEvent"
"WatchEvent"
"CreateEvent"
"PushEvent"
"PushEvent

With *distinct-values*, it is then possible to eliminate duplicates and look at unique values:

In [33]:
%%jsoniq 

distinct-values(json-file("https://www.rumbledb.org/samples/git-archive.json").type)

Took: 16.265569925308228 ms
"PullRequestEvent"
"MemberEvent"
"PushEvent"
"IssuesEvent"
"PublicEvent"
"CommitCommentEvent"
"ReleaseEvent"
"IssueCommentEvent"
"ForkEvent"
"GollumEvent"
"WatchEvent"
"PullRequestReviewCommentEvent"
"CreateEvent"
"DeleteEvent"


## Aggregations

Aggregations can be made on entire sequences with a single function call (this would be like a SQL GROUP BY clause but without grouping key). The five basic functions are *count*, *sum*, *avg*, *min*, *max*. Obviously, the last four require numeric values and will otherwise throw an
error.

In [34]:
%%jsoniq 

count(distinct-values(json-file("https://www.rumbledb.org/samples/git-archive.json").type))

Took: 13.541920900344849 ms
14


In [35]:
%%jsoniq 

count(json-file("https://www.rumbledb.org/samples/git-archive.json"))

Took: 19.22917628288269 ms
36577


# Construction

Let us now look into how to construct new values with JSONiq.

## Construction of atomic values

Atomic values that are core to JSON can be constructed with exactly the same syntax as JSON.

In [36]:
%%jsoniq 

"foo"

Took: 0.03772377967834473 ms
"foo"


In [37]:
%%jsoniq 

"This is a line\nand this is a new line"

Took: 0.034624338150024414 ms
"This is a line\nand this is a new line"


In [38]:
%%jsoniq 

42

Took: 0.0469212532043457 ms
42


In [39]:
%%jsoniq 

3.1415926535897932384626433832795028

Took: 0.039398193359375 ms
3.141592653589793


In [40]:
%%jsoniq 

-6.022E23

Took: 0.03641533851623535 ms
-6.022e+23


In [41]:
%%jsoniq 

true

Took: 0.043489694595336914 ms
true


In [42]:
%%jsoniq 

false

Took: 0.03895068168640137 ms
false


In [43]:
%%jsoniq 

null

Took: 0.039531707763671875 ms
null


For more specific types, a cast is needed. This works with any of the atomic types we covered in Chapter 7. There are two syntaxes for this:


In [44]:
%%jsoniq 

nonNegativeInteger("42")

Took: 0.04388833045959473 ms
42


In [45]:
%%jsoniq 

"42" cast as nonNegativeInteger

Took: 0.04434490203857422 ms
42


## Construction of objects and arrays

Objects and arrays are constructed with the same syntax as JSON. In fact, one can copy-paste any JSON value, and it will always be recognized as a valid JSONiq query returning that value.

In [46]:
%%jsoniq 

[
  { "foo" : "bar" },
  { "bar" : [ 1, 2, true, null ] }
]

Took: 0.04552102088928223 ms
[{"foo": "bar"}, {"bar": [1, 2, true, null]}]


It is also possible to build objects and arrays dynamically (with computed values, not known at compile time), as will be shown shortly when we discuss composability of expressions.

## Construction of sequences

Sequences can be constructed (and concatenated) using commas:


In [47]:
%%jsoniq 

[ 2, 3 ], true, "foo", { "f" : 1 }

Took: 0.048543453216552734 ms
[2, 3]
true
"foo"
{"f": 1}


Increasing sequences of integers can also be built with the *to* keyword:

In [48]:
%%jsoniq 

1 to 10

Took: 0.05591607093811035 ms
1
2
3
4
5
6
7
8
9
10


Another way to build sequences is with FLWOR expressions, covered a bit further down.

# Scalar expressions

Sequences of items can have any number of items. A few JSONiq expression (arithmetic, logic, value comparison...) work on the particular case that a sequence has zero or one items.

## Arithmetic

JSONiq supports basic arithmetic: addition (+), subtraction (-), division (div)<sup>2</sup>, integer division (idiv), and modulo (mod).

If both sides have exactly one item, the semantics is relatively natural.

---

2. Note that this is not a slash (/).

In [49]:
%%jsoniq 

1+1

Took: 0.05556178092956543 ms
2


In [50]:
%%jsoniq 

42-10

Took: 0.05889725685119629 ms
32


In [51]:
%%jsoniq 

6*7

Took: 0.06521224975585938 ms
42


In [52]:
%%jsoniq 

42.3 div 7.2

Took: 0.06293535232543945 ms
5.875


In [53]:
%%jsoniq 

42 idiv 9

Took: 0.037085533142089844 ms
4


In [54]:
%%jsoniq 

42 mod 9

Took: 0.03970003128051758 ms
6


If the data types are different, then conversions are made automatically:
- If one side is a double and the other side a float, then the float is
converted to a double and a double is returned.

- If one side is a double and the other side a decimal (or integer,
etc), then the decimal is converted to a double and a double is
returned.


- If one side is a float and the other side a decimal (or integer, etc),
then the decimal is converted to a float and a float is returned.

Note that an integer and a decimal are not considered different here, because an integer is a special case of decimal. Adding a decimal with an integer returns a decimal.

The empty sequence enjoys special treatment: if one of the sides (or both) is the empty sequence, then the arithmetic expression returns an empty sequence (no error):

In [55]:
%%jsoniq 

() + 1

Took: 0.03874969482421875 ms


Arithmetic expressions also work with dates, times, and durations in a natural fashion (e.g., a date + a duration = a date).

In [56]:
%%jsoniq 

date("2024-12-06") - date ("2023-11-06")

Took: 0.03772306442260742 ms
"P396D"


In [57]:
%%jsoniq 

date("2024-12-06") + dayTimeDuration("P31D")

Took: 0.040322065353393555 ms
"2025-01-06"


If one of the sides (or both) is not a number, a date, a time, a dateTime, a duration, or the empty sequence, or the involves types are inconsistent with each other, then a type error is thrown.

## String manipulation

String concatenation is done with a double vertical bar:

In [58]:
%%jsoniq 

"foo" || "bar"

Took: 0.041346073150634766 ms
"foobar"


Most other string manipulation primitives are available from the rich JSONiq builtin function library (itself relying on a W3C standard called XPath and XQuery Functions and Operators). The complete list of functions in this standard is available at https://www.w3.org/TR/xpath-functions/, although XML-related functions should be ignored, and JSONiq supports additional JSON-related functions (such as keys(), etc).

In [59]:
%%jsoniq 

concat("foo", "bar")

Took: 0.03968024253845215 ms
"foobar"


In [60]:
%%jsoniq 

string-join(("foo", "bar", "foobar"), "-")

Took: 0.040155887603759766 ms
"foo-bar-foobar"


In [61]:
%%jsoniq 

substring("foobar", 4, 3)

Took: 0.041104793548583984 ms
"bar"


In [62]:
%%jsoniq 

string-length("foobar")

Took: 0.04467415809631348 ms
6


## Value comparison

Sequences of one atomic item can be compared with eq (equal), ne (not equal), le (lower or equal), ge (greater or equal), lt (lower than) and gt (greater than).


In [63]:
%%jsoniq 

1 + 1 eq 2 

Took: 0.04405927658081055 ms
true


In [64]:
%%jsoniq 

6 * 7 ne 21 * 2 

Took: 0.03878426551818848 ms
false


In [65]:
%%jsoniq 

234 gt 123

Took: 0.038590192794799805 ms
true


If one of the two sides is the empty sequence, then the value comparison expression returns an empty sequence as well.

In [66]:
%%jsoniq 

() le 2

Took: 0.05056357383728027 ms


If one of the two sides is null, then the value comparison expression returns null as well.

In [67]:
%%jsoniq 

null le 2

Took: 0.05005288124084473 ms
true


## Logic

JSONiq supports the three basic logic expressions and, or, and not. not has the highest precedence, then and, then or.

In [68]:
%%jsoniq 

1 + 1 eq 2 and (2 + 2 eq 4 or not 100 mod 5 eq 0)

Took: 0.04051971435546875 ms
true


JSONiq also supports universal and existential quantification:

In [69]:
%%jsoniq 

every $i in 1 to 10
satisfies $i gt 0

Took: 0.04870748519897461 ms
true


In [70]:
%%jsoniq 

some $i in 1 to 10
satisfies $i gt 5

Took: 0.04550766944885254 ms
true


Note that unlike SQL, JSONiq logic expressions are two-valued and return either true or false.

If one of the two sides is not a sequence of a single Boolean item, then implicit conversions are made. This mechanism is called the Effective Boolean Value (EBV). For example, an empty sequence, or a sequence of one empty string, or a sequence of one zero integer, is considered false. A sequence of one non-empty string, or a sequence of one non-zero integer, or a sequence starting with one object (or array) is considered true.

## General comparison

JSONiq has a shortcut for existential quantification on value comparisons.
This is called general comparison.

For example, consider this query:

In [71]:
%%jsoniq 

some $i in (1, 2, 3, 4, 5)
satisfies $i eq 1

Took: 0.04189181327819824 ms
true


It can be abbreviated to the shorter:

In [72]:
%%jsoniq 

(1, 2, 3, 4, 5) = 1

Took: 0.041162729263305664 ms
true


More generally,

In [73]:
%%jsoniq 

some $i in 1 to 5, $j in 3 to 10
satisfies $i gt $j

Took: 0.0414731502532959 ms
true


can be abbreviated to:

In [74]:
%%jsoniq 

1 to 5 > 3 to 10

Took: 0.03508949279785156 ms
true


In other words, = (resp. ! =, <, >, <=, >=) is a shortcut for an existential quantification on both input sequences on the value comparison eq (resp. ne, lt, gt, le, ge).

In particular, errors are thrown for incompatible types, and false is returned if any side is the empty sequence.

General comparison is very convenient when scanning datasets and
looking for matching values.

In [75]:
%%jsoniq 

json-doc("file.json").o[].a.b[].c = 1

Took: 0.07933855056762695 ms
true


# Composability

JSONiq, as a functional language, is modular. This means that expressions can be combined at will, exactly like one would combine addition, multiplication, etc, at will.

Any expression can appear as the operand of any other expression. Of course, if the output of an expression is not compatible with what the parent expression expects, an error is thrown.

We show below an example and the successive details of the evaluation.

```python 
(1 + (({"b":[{"a": 2+2}]}).b[].a)) to 10  # Query 
(1 + (({"b":[{"a": 4}]}).b[].a)) to 10    # Step 1 
(1 + ([{"a": 4}][].a)) to 10              # Step 2 
(1 + ({"a": 4}.a)) to 10                  # Step 3 
(1 + 4) to 10                             # Step 4
5 to 10                                   # Step 5 
```

We can execute the query to verify the above results

In [76]:
%%jsoniq 

(1 + (({"b":[{"a": 2+2}]}).b[].a)) to 10

Took: 0.03081965446472168 ms
5
6
7
8
9
10


Here is another example:

In [77]:
%%jsoniq 

{
  "attr" : string-length("foobar"),
  "values" : [
    for $i in 1 to 10
    return $i
  ]
}

Took: 0.03732180595397949 ms
{"attr": 6, "values": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}


Just like arithmetic, where multiplication has precedence over addition, expressions have a precedence order. Precedence can be easily overridden with parentheses. In practice, it is not realistic to know all precedences by heart, so that when in doubt, it is always a good idea to add parentheses to be on the safe side.

We list the expression in increasing order of precedence below. Beware, the low precedence of the comma is a common pitfall.

| Precedence (low first)                          |
|-------------------------------------------------|
| Comma                                           |
| Data Flow (FLWOR, if-then-else, switch...)      |
| Logic                                           |
| Comparison                                      |
| String concatenation                            |
| Range                                           |
| Arithmetic                                      |
| Path expressions                                |
| Filter predicates, dynamic function calls      |
| Literals, constructors and variables           |
| Function calls, named function references, inline functions |

## Data flow

A few expressions give some control over the data flow by picking the output or this or that expression based on the value of another expression. These expression look quite close to their counterparts in imperative languages (Python, Java...), but it is important to understand that they have functional semantics here.

This includes conditional expressions. If the expression inside the if returns true (or if its Effective Boolean Value is true), the result of the expression in the then clause is taken, otherwise, the result of the expression in the else clause is taken.

In [78]:
%%jsoniq 

if(count(json-doc("file.json").o) gt 1000)
then "Large file!"
else "Small file."

Took: 0.10240960121154785 ms
"Small file."


This also includes switch expressions. The expression inside the swich is evaluated and an error is thrown if more than one item is returned. Then, the resulting item is compared for equality with each one of the candidate values. The result of the expression corresponding to the first match is taken, and if there are no matches, the result of the default expression is taken.

In [79]:
%%jsoniq 

switch(json-doc("file.json").o[[1]].a.b[[1]].c)
case 1 return "one"
case 2 return "two"
default return "other"

Took: 0.09586381912231445 ms
"one"


Note that we covered data types and cardinality indicators (*, +, ?) in Chapter 7.

This also includes try-catch expressions. If the expression in the try clause is successfully evaluated, then its results are taken. If there is an error, then the results of the expression in the first catch clause matching the error is taken (* being the joker).

In [80]:
%%jsoniq 

try {
   date(json-doc("file.json").o[[1]].a.b[[1]].c)
} catch * {
  "This is not a date!"
}

Took: 0.09519338607788086 ms
"This is not a date!"


# Binding variables with cascades of let clauses

Let us go back to this example:

In [81]:
%%jsoniq 

json-doc("file.json").o[].a.b[].c = 1

Took: 0.11407852172851562 ms
true


We can rewrite it by explicitly binding intermediate variables, like so:

In [82]:
%%jsoniq 

let $a := json-doc("file.json")
let $b := $a.o
let $c := $b[]
let $d := $c.a
let $e := $d.b
let $f := $e[]
let $g := $f.c
return $g = 1

Took: 0.10903334617614746 ms
true


Variables in JSONiq start with a dollar sign. This way of subsequently binding variables to compute intermediate results is typical of functional language: OCAML, F#, Haskell... have a similar feature. It is important to understand that this is not a variable assignment that would *change* the value of a variable. This is only a declarative binding. 

It is possible to reuse the same variable name, in which case the previous binding is hidden. Again, this is not an assignment or a change of value.

In [83]:
%%jsoniq 

let $a := json-doc("file.json")
let $a := $a.o
let $a := $a[]
let $a := $a.a
let $a := $a.b
let $a := $a[]
let $a := $a.c
return $a = 1

Took: 0.0991218090057373 ms
true


Each variable is visible in all subsequent let clauses, as well as in the final return clause (unless/until it is hidden by another let clause with the same variable name). It is not visible to any parent expressions. In particular, this query returns an error because the last reference to variable  $\$$a is not within the scope of the $\$$a binding:

In [84]:
%%jsoniq 

(let $a := json-doc("file.json")
let $b := $a.o
let $c := $b[]
let $d := $c.a
let $e := $d.b
let $f := $e[]
let $g := $f.c
return $g = 1
) + $a

Took: 0.07114148139953613 ms
There was an error on line 10 in file:/home/:

) + $a
    ^

Code: [XPST0008]
Message: Uninitialized variable reference: a
Metadata: file:/home/:LINE:10:COLUMN:4:
This code can also be looked up in the documentation and specifications for more information.



# FLWOR expressions

One of the most important and powerful features of JSONiq is the FLWOR expression. It corresponds to SELECT-FROM-WHERE queries in SQL, however, it is considerably more expressive and generic than them in several aspects:

- In SQL, the clauses must appear in a specific order (SELECT, then FROM, then WHERE, then GROUP BY, then HAVING, then ORDER BY, then OFFSET, then LIMIT), although most are optional. In JSONiq the clauses can appear in any order with the exception of the first and last clause.

- JSONiq supports a let clause, which does not exist in SQL. Let clauses make it very convenient to write and organize more complex queries.

- In SQL, when iterating over multiple tables in the FROM clause, they “do not see each other”, i.e., the semantics is (logically) that of a Cartesian product. In JSONiq, for clauses (which correspond to FROM clauses in SQL), do see each other, meaning that it is possible to iterate in higher and higher levels of nesting by referring to a previous for variable. This is both easier to write and read than lateral views, and it is also more expressive.

- The semantics of FLWOR clauses is simple, clean, and inherently functional; it is based on tuple streams containing variable bindings, which flow from clause to clause. There is no “spooky action at a distance” such as the explode() function, which indirectly causes a duplication of rows in Spark SQL.

## Simple dataset

For the purpose of illustration, we will use a very simple dataset consisting of two JSON Lines files:

In [85]:
import json

# Data for products.json
products_data = [
    {"pid": 1, "type": "tv", "store": 1},
    {"pid": 2, "type": "tv", "store": 2},
    {"pid": 3, "type": "phone", "store": 2},
    {"pid": 4, "type": "tv", "store": 3},
    {"pid": 5, "type": "teapot", "store": 2},
    {"pid": 6, "type": "tv", "store": 1},
    {"pid": 7, "type": "teapot", "store": 2},
    {"pid": 8, "type": "phone", "store": 4}
]

# Data for stores.json
stores_data = [
    {"sid": 1, "country": "Switzerland"},
    {"sid": 2, "country": "Germany"},
    {"sid": 3, "country": "United States"}
]

# Create products.json in JSON Lines format
with open("products.json", "w") as products_file:
    for product in products_data:
        products_file.write(json.dumps(product) + "\n")

# Create stores.json in JSON Lines format
with open("stores.json", "w") as stores_file:
    for store in stores_data:
        stores_file.write(json.dumps(store) + "\n")


Note that the Store ID 4 for stores is intentionally missing from stores.json, this is for the purpose of showing what happens if there are no matches.

## For clauses

For clauses bind their variable in turn to each item of the provided expression. Here is an example:

In [86]:
%%jsoniq 

for $x in 1 to 10
return
  {
    "number": $x,
    "square": $x * $x
  }

Took: 0.0618894100189209 ms
{"number": 1, "square": 1}
{"number": 2, "square": 4}
{"number": 3, "square": 9}
{"number": 4, "square": 16}
{"number": 5, "square": 25}
{"number": 6, "square": 36}
{"number": 7, "square": 49}
{"number": 8, "square": 64}
{"number": 9, "square": 81}
{"number": 10, "square": 100}


In the above query, the variable $x is bound with 1, then with 2, then with 3, etc, and finally with 10. It is always bound with a sequence of exactly one item. It is, however, possible to bind it with an empty sequence if the expression returns no items. This is done with “allowing empty”.

In [87]:
%%jsoniq 

for $x allowing empty in ()
return count($x)

Took: 0.05500984191894531 ms
0


Note that, without “allowing empty”, if the expression in the for clause evaluates to an empty sequence, the variable would not bind to anything at all and the FLWOR expression would simply return an empty sequence.

In [88]:
%%jsoniq 

for $x in ()
return count($x)

Took: 0.07542657852172852 ms


Each variable binding is also more generally called a tuple. In this above examples, there is only one variable binding in each tuple ($\$$x), but it is possible to build larger tuples with more clauses. For example, this FLWOR expression involves two for clauses. The tuples after the first for clause and before the second one only bind variable $\$$x (to 1, then to 2, then to 3), but the tuple after the second for clause and before the return clause bind variables $\$$x and $\$$y. There are six tuples in total, because the second for clause expands each incoming tuple to zero, one or more tuples (think of a flatMap transformation in Spark for an analogy).

In [89]:
%%jsoniq 

for $x in 1 to 3
for $y in 1 to $x
return [ $x, $y ]

Took: 0.04943656921386719 ms
[1, 1]
[2, 1]
[2, 2]
[3, 1]
[3, 2]
[3, 3]


Now if we use our small example dataset, we can iterate on all objects, say, products:

In [90]:
%%jsoniq 

for $product in json-file("products.json")
return $product.type

Took: 0.44768404960632324 ms
"tv"
"tv"
"phone"
"tv"
"teapot"
"tv"
"teapot"
"phone"


It can thus be seen that the for clause is akin to the FROM clause in SQL, and the return is akin to the SELECT clause.

Projection in JSONiq can be made with a project() function call, with the keys to keep:

In [91]:
%%jsoniq 

for $product in json-file("products.json")
return project($product, ("type", "store"))

Took: 0.46694183349609375 ms
{"type": "tv", "store": 1}
{"type": "tv", "store": 2}
{"type": "phone", "store": 2}
{"type": "tv", "store": 3}
{"type": "teapot", "store": 2}
{"type": "tv", "store": 1}
{"type": "teapot", "store": 2}
{"type": "phone", "store": 4}


It is possible to implement a join with a sequence of two for clauses and a predicate (note that newlines in JSONiq are irrelevant, so we spread the for clause on two lines in order to fit the query on this page):

In [92]:
%%jsoniq 

for $product in json-file("products.json")
for $store in json-file("stores.json")[$$.sid eq $product.store]
return {
  "product" : $product.type,
  "country" : $store.country
}

Took: 0.7625768184661865 ms
{"product": "tv", "country": "Switzerland"}
{"product": "tv", "country": "Switzerland"}
{"product": "tv", "country": "Germany"}
{"product": "phone", "country": "Germany"}
{"product": "teapot", "country": "Germany"}
{"product": "teapot", "country": "Germany"}
{"product": "tv", "country": "United States"}


Note that allowing empty can be used to perform a left outer join, i.e., to account for the case when there are no matching records in the second collection:

In [93]:
%%jsoniq 

for $product in json-file("products.json")
for $store allowing empty in json-file("stores.json")[$$.sid eq $product.store]
return {
  "product" : $product.type,
  "country" : $store.country
}

Took: 0.7564096450805664 ms
{"product": "tv", "country": "Switzerland"}
{"product": "tv", "country": "Germany"}
{"product": "phone", "country": "Germany"}
{"product": "tv", "country": "United States"}
{"product": "teapot", "country": "Germany"}
{"product": "tv", "country": "Switzerland"}
{"product": "teapot", "country": "Germany"}
{"product": "phone", "country": null}


In the case of the last product, no matching record in stores.json is found and $\$$store is bound to the empty sequence for that tuple. When constructing the object in the return clause’s expression, the empty sequence obtained from $\$$store.country is automatically replaced with a null value (because an object value cannot be empty). But if we add an array constructor around the country, we will notice the empty sequence:

In [94]:
%%jsoniq 

for $product in json-file("products.json")
for $store allowing empty in json-file("stores.json")[$$.sid eq $product.store]
return {
  "product" : $product.type,
  "country" : [ $store.country ]
}

Took: 0.6417806148529053 ms
{"product": "tv", "country": ["Switzerland"]}
{"product": "tv", "country": ["Germany"]}
{"product": "phone", "country": ["Germany"]}
{"product": "tv", "country": ["United States"]}
{"product": "teapot", "country": ["Germany"]}
{"product": "tv", "country": ["Switzerland"]}
{"product": "teapot", "country": ["Germany"]}
{"product": "phone", "country": []}


## Let clauses

As seen before, the let clause can be used to bind a variable with any sequence of items, also more than one. FLWOR expressions with just a cascade of let clauses are quite popular.

In [95]:
%%jsoniq 

let $x := 2
return $x * $x

Took: 0.04840493202209473 ms
4


However, let clauses can also appear after other clauses, for example, after a for clause. Then, they will bind a sequence of items for each previous binding (tuple), like so:

In [96]:
%%jsoniq 

for $x in 1 to 10
let $square := $x * $x
return
{
  "number": $x,
  "square": $square
}

Took: 0.05125927925109863 ms
{"number": 1, "square": 1}
{"number": 2, "square": 4}
{"number": 3, "square": 9}
{"number": 4, "square": 16}
{"number": 5, "square": 25}
{"number": 6, "square": 36}
{"number": 7, "square": 49}
{"number": 8, "square": 64}
{"number": 9, "square": 81}
{"number": 10, "square": 100}


In the above example, $square is only bound with one item. Here is another example where it is bound with more than one:

In [97]:
%%jsoniq 

for $x in 1 to 10
let $square-and-cube := ($x * $x, $x * $x * $x)
return
  {
    "number": $x,
    "square": $square-and-cube[1],
    "cube": $square-and-cube[2]
  }

Took: 0.03974437713623047 ms
{"number": 1, "square": 1, "cube": 1}
{"number": 2, "square": 4, "cube": 8}
{"number": 3, "square": 9, "cube": 27}
{"number": 4, "square": 16, "cube": 64}
{"number": 5, "square": 25, "cube": 125}
{"number": 6, "square": 36, "cube": 216}
{"number": 7, "square": 49, "cube": 343}
{"number": 8, "square": 64, "cube": 512}
{"number": 9, "square": 81, "cube": 729}
{"number": 10, "square": 100, "cube": 1000}


Note the difference with a for clause:

In [98]:
%%jsoniq 

for $x in 1 to 10
for $square-or-cube in ($x * $x, $x * $x * $x)
return
  {
    "number": $x,
    "square or cube": $square-or-cube
  }

Took: 0.06417059898376465 ms
{"number": 1, "square or cube": 1}
{"number": 1, "square or cube": 1}
{"number": 2, "square or cube": 4}
{"number": 2, "square or cube": 8}
{"number": 3, "square or cube": 9}
{"number": 3, "square or cube": 27}
{"number": 4, "square or cube": 16}
{"number": 4, "square or cube": 64}
{"number": 5, "square or cube": 25}
{"number": 5, "square or cube": 125}
{"number": 6, "square or cube": 36}
{"number": 6, "square or cube": 216}
{"number": 7, "square or cube": 49}
{"number": 7, "square or cube": 343}
{"number": 8, "square or cube": 64}
{"number": 8, "square or cube": 512}
{"number": 9, "square or cube": 81}
{"number": 9, "square or cube": 729}
{"number": 10, "square or cube": 100}
{"number": 10, "square or cube": 1000}


A let clause outputs exactly one outgoing tuple for each incoming tuple (think of a map transformation in Spark). Unlike the for clause, it does not modify the number of tuples. 

Let us now showcase the use of a let clause with our dataset. 

Now if we use our small example dataset, we can iterate on all objects, say, products:

In [99]:
%%jsoniq 

for $product in json-file("products.json")
let $type := $product.type
return $type

Took: 0.45885252952575684 ms
"tv"
"tv"
"phone"
"tv"
"teapot"
"tv"
"teapot"
"phone"


Let clauses also allow for joining the two datasets and denormalizing them by nesting the stores into the products. This would be considerably more difficult to do with (Spark) SQL, even with extensions. The results are pretty-printed for ease of read.

In [100]:
%%jsoniq 

for $store in json-file("stores.json")
let $product := json-file("products.json")[$store.sid eq $$.store]
return {
  "store" : $store.country,
  "available products" : [distinct-values($product.type)]
}

Took: 0.8935444355010986 ms
{"store": "Switzerland", "available products": ["tv"]}
{"store": "Germany", "available products": ["tv", "phone", "teapot"]}
{"store": "United States", "available products": ["tv"]}


## Where clauses

Where clauses are used to filter variable bindings (tuples) based on a predicate on these variables. They are the equivalent to a WHERE clause in SQL.

This is a simple example of its use in conjunction with a for clause:

In [101]:
%%jsoniq 

for $x in 1 to 10
where $x gt 7
return {
  "number": $x,
  "square": $x * $x
}

Took: 0.0701608657836914 ms
{"number": 8, "square": 64}
{"number": 9, "square": 81}
{"number": 10, "square": 100}


A where clause can appear anywhere in a FLWOR expression, except that it cannot be the first clause (always for or let) or the last clause (always return).

In [102]:
%%jsoniq 

for $x in 1 to 10
let $square := $x * $x
where $square gt 60
for $y in $square to $square + 1
return {
  "number": $x,
  "y": $y
}

Took: 0.05547904968261719 ms
{"number": 8, "y": 64}
{"number": 8, "y": 65}
{"number": 9, "y": 81}
{"number": 9, "y": 82}
{"number": 10, "y": 100}
{"number": 10, "y": 101}


A where clause always outputs a subset (or all) of its incoming tuples, without any alteration. In the case that the predicate always evaluates to true, it forwards all tuples, as if there had been no where clause at all. In the case that the predicate always evaluates to false, it outputs no tuple and the FLWOR expression will then return the empty sequence, with no need to further evaluate any of the remaining clauses.

Here is another example of use of the where clause with our datasets:

In [103]:
%%jsoniq 

for $product in json-file("products.json")
let $store := json-file("stores.json")[$$.sid eq $product.store]
where $store.country = "Germany"
return $product.type

Took: 0.9756145477294922 ms
"tv"
"phone"
"teapot"
"teapot"


## Order by clauses

Order by clauses are used to reorganize the order of the tuples, but without altering them. They are the same as ORDER BY clauses in SQL.

In [104]:
%%jsoniq 

for $x in -2 to 2
let $square := $x * $x
order by $square
return {
  "number": $x,
  "square": $square
}

Took: 0.05294680595397949 ms
{"number": 0, "square": 0}
{"number": -1, "square": 1}
{"number": 1, "square": 1}
{"number": -2, "square": 4}
{"number": 2, "square": 4}


It is also possible, like in SQL, to specify an ascending or a descending order. By default, the order is ascending.

In [105]:
%%jsoniq 

for $x in -2 to 2
let $square := $x * $x
order by $square ascending
return {
  "number": $x,
  "square": $square
}

Took: 0.05420327186584473 ms
{"number": 0, "square": 0}
{"number": -1, "square": 1}
{"number": 1, "square": 1}
{"number": -2, "square": 4}
{"number": 2, "square": 4}


In [106]:
%%jsoniq 

for $x in -2 to 2
let $square := $x * $x
order by $square descending
return {
  "number": $x,
  "square": $square
}

Took: 0.05096888542175293 ms
{"number": -2, "square": 4}
{"number": 2, "square": 4}
{"number": -1, "square": 1}
{"number": 1, "square": 1}
{"number": 0, "square": 0}


In case of ties between tuples, the order is arbitrary. But it is possible to sort on another variable in case there is a tie with the first one (compound sorting keys):

In [107]:
%%jsoniq 

for $x in -2 to 2
let $square := $x * $x
order by $square descending, $x ascending
return {
  "number": $x,
  "square": $square
}

Took: 0.04914379119873047 ms
{"number": -2, "square": 4}
{"number": 2, "square": 4}
{"number": -1, "square": 1}
{"number": 1, "square": 1}
{"number": 0, "square": 0}


It is possible to control what to do with empty sequences: they can be considered smallest or greatest.

In [108]:
%%jsoniq 

for $x in 1 to 5
let $y := $x[$$ mod 2 = 1]
order by $y ascending empty greatest
return [ $y ]

Took: 0.053754329681396484 ms
[1]
[3]
[5]
[]
[]


In [109]:
%%jsoniq 

for $x in 1 to 5
let $y := $x[$$ mod 2 = 1]
order by $y ascending empty least
return [ $y ]

Took: 0.06838321685791016 ms
[]
[]
[1]
[3]
[5]


Here is another example of use of the order by clause with our datasets:

In [110]:
%%jsoniq 

for $product in json-file("products.json")
let $store := json-file("stores.json")[$$.sid eq $product.store]
group by $t := $product.type
order by count($store) descending,
string-length($t) ascending
return $t

Took: 5.172745227813721 ms
"tv"
"teapot"
"phone"


## Group by clauses

Group by clauses organize tuples in groups based on matching keys, and then output only one tuple for each group, aggregating other variables (count, sum, max, min...). This is similar to GROUP BY clauses in SQL.

In [111]:
%%jsoniq 

for $x in 1 to 5
let $y := $x mod 2
group by $y
return {
  "grouping key" : $y,
  "count of x" : count($x)
}

Took: 0.06380772590637207 ms
{"grouping key": 0, "count of x": 2}
{"grouping key": 1, "count of x": 3}


However, JSONiq’s group by clauses are more powerful and expressive than SQL GROUP BY clauses: indeed, it is also possible to opt out of aggregating other (non-grouping-key) variables. Then, for a nonaggregated variable, the sequence of all its values within a group will be rebound to this same variable as a single binding in the outcoming tuple. It is thus possible to write many more queries than SQL would allow, which is one of the reasons why a language like JSONiq should be preferred for nested datasets.

In [112]:
%%jsoniq 

for $x in 1 to 5
let $y := $x mod 2
group by $y
return {
  "grouping key" : $y,
  "grouped x values" : [ $x ]
}

Took: 0.06590604782104492 ms
{"grouping key": 0, "grouped x values": [2, 4]}
{"grouping key": 1, "grouped x values": [1, 3, 5]}


Finally, here is an example of use of a group by clause with our example dataset.

In [113]:
%%jsoniq 

for $product in json-file("products.json")
group by $sid := $product.store
let $store := json-file("stores.json")[$$.sid eq $sid]
order by $sid 
return {|
  $store,
  { "products" : [ distinct-values($product.type) ] }
|}

Took: 3.4598042964935303 ms
{"sid": 1, "country": "Switzerland", "products": ["tv"]}
{"sid": 2, "country": "Germany", "products": ["tv", "phone", "teapot"]}
{"sid": 3, "country": "United States", "products": ["tv"]}
{"products": ["phone"]}


## Tuple stream visualization

Although it is unnecessary to write simple FLWOR expressions, a visualization can be helpful in order to understand how more complex FLWOR expressions are evaluated. We give below a few examples of how tuple streams within a FLWOR expression can be seen as tables (or DataFrames) in which each bound variable is represented in a column:

<div style="text-align: center;">
    <img src="https://raw.githubusercontent.com/RumbleDB/bigdata-exercises/master/Big_Data/JSONiq-Jupyter/Textbook_images/283.png" alt="image" style="width: 60%; height: auto;"> <br>
    <img src="https://raw.githubusercontent.com/RumbleDB/bigdata-exercises/master/Big_Data/JSONiq-Jupyter/Textbook_images/284.png" alt="image" style="width: 60%; height: auto;"> <br>
    <img src="https://raw.githubusercontent.com/RumbleDB/bigdata-exercises/master/Big_Data/JSONiq-Jupyter/Textbook_images/285.png" alt="image" style="width: 60%; height: auto;">
</div> 

Note, however, that these tuple streams are not sequences of items, because clauses are not expressions; tuple streams are only a formal description of the semantics of FLWOR expressions and their visualization as DataFrames is pedagogical. Having said that, the reader may have guessed that tuple streams can be internally implemented as Spark DataFrames, and in fact, RumbleDB does just that (but it hides it from the user).

# Types

The type system in JSONiq is consistent with what was covered in Chapter 7.

In this section we are going to mainly use a bigger git-archive dataset, [git-archive-big.json](https://www.rumbledb.org/samples/git-archive-big.json). You can already check that you get the correct number of records. The dataset should contain 206978 records. You can either use wget to download and read the dataset locally or simply read with json-file from the URI.

We recommend running the cell below to download the data (reading it directly from the URL is slow and hard to debug using the notebook interface).

In [114]:
# Download the big git-archive dataset 
!wget https://www.rumbledb.org/samples/git-archive-big.json

# Download a smaller git-archive dataset 
!wget https://www.rumbledb.org/samples/git-archive.json

--2024-12-20 18:38:24--  https://www.rumbledb.org/samples/git-archive-big.json
Resolving www.rumbledb.org (www.rumbledb.org)... 52.85.223.97, 52.85.223.31, 52.85.223.120, ...
Connecting to www.rumbledb.org (www.rumbledb.org)|52.85.223.97|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 532404791 (508M) [application/json]
Saving to: ‘git-archive-big.json’


2024-12-20 18:39:41 (6.67 MB/s) - ‘git-archive-big.json’ saved [532404791/532404791]

--2024-12-20 18:39:41--  https://www.rumbledb.org/samples/git-archive.json
Resolving www.rumbledb.org (www.rumbledb.org)... 52.85.223.120, 52.85.223.31, 52.85.223.39, ...
Connecting to www.rumbledb.org (www.rumbledb.org)|52.85.223.120|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84381884 (80M) [application/json]
Saving to: ‘git-archive.json’


2024-12-20 18:39:52 (7.45 MB/s) - ‘git-archive.json’ saved [84381884/84381884]



## Variable types

It is possible to annotate any FLWOR variable with an expected type as shown below.

In [115]:
%%jsoniq 

let $path as string := "git-archive-big.json"
let $events as object* := json-file($path)
let $actors as object* := $events.actor
let $logins as string* := $actors.login
let $distinct-logins as string* :=
distinct-values($logins)
let $count as integer := count($distinct-logins)
return $count

Took: 3.1215648651123047 ms
53744


Since every value in JSONiq is a sequence of item, a sequence type consists of two parts: an item type, and a cardinality. 

Item types can be any of the builtin atomic types (JSound) covered in Chapter 7, as well as “object”, “array” and the most generic item type, “item”. Cardinality can be one of the following four:

- Any number of items (suffix *); for example object\*
  
- One or more items (suffix +); for example array+
  
- Zero or one item (suffix ?); for example boolean?
  
- Exactly one item (no suffix); for example integer

If it is detected, at runtime, that a sequence of items is bound to a variable but does not match the expected sequence type, either because one of the items does not match the expected item type, or because the cardinality of the sequence does not match the expected cardinality, then a type error is thrown and the query is not evaluated.

It is also possible to annotate variables in for clauses, however the cardinality of the sequence type of a for variable will logically be either one (no suffix), or zero-or-one (?) in the case that “allowing empty” is specified.

## Type expressions

JSONiq has a few expressions related to types.

An instance of expression checks whether a sequences matches a sequence type, and returns true or false. This is similar to the homonymous expression in Java.

In [116]:
%%jsoniq 

(3.14, "foo") instance of integer*,
([1], [ 2, 3 ]) instance of array+

Took: 0.03475332260131836 ms
false
true


A cast as expression casts single items to an expected item type.

In [117]:
%%jsoniq 

"3.14" cast as decimal

Took: 0.02958369255065918 ms
3.14


A cast as expression can also deal with an empty sequence, and supports the zero-or-more cardinality in the expected resulting type. But it will throw an error if the sequence has more than one item: you need to use a FLWOR expression if you want to cast every item in a sequence.

In [118]:
%%jsoniq 

[1, 2, 3, 4][$$ > 4] cast as string?

Took: 0.05576586723327637 ms
There was an error on line 2 in file:/home/:

[1, 2, 3, 4][$$ > 4] cast as string?
             ^

Code: [JNTY0004]
Message: Invalid args. Comparison can't be performed on array type
Metadata: file:/home/:LINE:2:COLUMN:13:
This code can also be looked up in the documentation and specifications for more information.



A castable as expression tests whether a cast would succeed (in which case it returns true) or not (false).

In [119]:
%%jsoniq 

"3.14" castable as decimal

Took: 0.031230688095092773 ms
true


A treat as expression checks whether its input sequence matches an expected type (like a type on a variable); if it does, the input sequence is returned unchanged. If not, an error is raised. This is useful in complex queries and for debugging purposes.

In [120]:
%%jsoniq 

[ 1, 2, 3, 4][] treat as integer+

Took: 0.030903339385986328 ms
1
2
3
4


There are also typeswitch expressions. The expression inside the typeswich is evaluated. Then, the resulting sequence is type-matched with each one of the sequence types. The result of the expression corresponding to the first match is taken, and if there are no matches, the result of the default expression is taken.

In [121]:
%%jsoniq 

typeswitch(json-doc("file.json").o[[1]].a.b[[1]].c)
case integer+ return "integer"
case string return "string"
default return "other"

Took: 0.08123111724853516 ms
"integer"


## Types in user-defined functions

JSONiq supports user-defined functions. Parameter types can be optionally specified, and a return type can also be optionally specified.

In [122]:
%%jsoniq 

declare function is-big-data(
  $threshold as integer,
  $objects as object*
) as boolean
{
  count($objects) gt $threshold
};

is-big-data(1000, json-file("git-archive.json"))

Took: 6.575667858123779 ms
true


But also:

In [123]:
%%jsoniq 

declare function is-big-data(
  $threshold,
  $objects
)
{
  count($objects) gt $threshold
};

is-big-data(1000, json-file("git-archive.json"))

Took: 1.5746407508850098 ms
true


## Validating against a schema

It is possible to declare a schema, associating it with a user-defined type, and to validate a sequence of items against this user-defined type.

In [124]:
%%jsoniq 

declare type local:histogram as {
  "commits" : "short",
  "count" : "long"
};

validate type local:histogram* {
  for $event in json-file("git-archive-big.json")
  group by $nb-commits := (size($event.payload.commits), 0)[1]
  order by $nb-commits
  return {
    "commits" : $nb-commits,
    "count" : count($event)
  }
}

Took: 5.272742033004761 ms
{"commits": 0, "count": 94554}
{"commits": 1, "count": 92094}
{"commits": 2, "count": 9951}
{"commits": 3, "count": 3211}
{"commits": 4, "count": 1525}
{"commits": 5, "count": 877}
{"commits": 6, "count": 688}
{"commits": 7, "count": 426}
{"commits": 8, "count": 383}
{"commits": 9, "count": 259}
{"commits": 10, "count": 274}
{"commits": 11, "count": 193}
{"commits": 12, "count": 146}
{"commits": 13, "count": 104}
{"commits": 14, "count": 119}
{"commits": 15, "count": 89}
{"commits": 16, "count": 76}
{"commits": 17, "count": 70}
{"commits": 18, "count": 67}
{"commits": 19, "count": 46}
{"commits": 20, "count": 1826}


If the results of a JSONiq query have been validated against a JSound schema, under specific conditions (the same covered in Chapter 7 for a schema to be DataFrame compatible), then it is possible to save the output of the query in other formats than JSON, such as Parquet, Avro, or (if there is no nestedness) CSV.

# Architecture of a query engine

We now cover the physical architecture and implementation of a query engine such as RumbleDB.

## Static phase

When a query is received by an engine, it is text that needs to be parsed. The theory and techniques for doing this (context-free grammars, EBNF...) are covered in compiler design courses. The output of this is a tree structure called an Abstract Syntax Tree.

<div style="text-align: center;">
    <img src="https://raw.githubusercontent.com/RumbleDB/bigdata-exercises/master/Big_Data/JSONiq-Jupyter/Textbook_images/286.png" alt="image" style="width: 60%; height: auto;">
</div> 

An Abstract Syntax Tree, even though it already has the structure of a tree, is tightly tied to the original syntax. Thus, it needs to be converted into a more abstract Intermediate Representation called an expression tree. Every node in this tree corresponds to either an expression or a clause in the JSONiq language, making the design modular.

<div style="text-align: center;">
    <img src="https://raw.githubusercontent.com/RumbleDB/bigdata-exercises/master/Big_Data/JSONiq-Jupyter/Textbook_images/287.png" alt="image" style="width: 60%; height: auto;">
</div> 

At this point, static typing takes place, meaning that the engine infers the static type of each expression, that is, the most specific type possible expected at runtime (but without actually running the program). User-specified types are also taken into account for this step. Inferring static types facilitates the optimization step.

Engines like RumbleDB perform their optimization round on this Intermediate Representation. Optimizations consist in changing the
tree to another one that will evaluate faster, but without changing the semantics of the query (i.e., it should produce the same output). An example is that, if RumbleDB detects that both sides to a general comparison are single items, then the comparison is rewritten as a more efficient value comparison. Another example is that user-defined function calls are “inlined”, meaning that the body of the function is copied over instead of the function call, as if the user had written it manually there.

Once optimizations have been done, RumbleDB decides the mode with which each expression and clause will be evaluated (locally, sequentially, in parallel, in DataFrames, etc). The resulting expression tree is then converted to a runtime iterator tree; this is the query plan that will actually be evaluated by the engine.

<div style="text-align: center;">
    <img src="https://raw.githubusercontent.com/RumbleDB/bigdata-exercises/master/Big_Data/JSONiq-Jupyter/Textbook_images/288.png" alt="image" style="width: 60%; height: auto;">
</div>

Every node in a runtime iterator tree outputs either a sequence of items (if it corresponds to an expression) or a tuple stream (if it corresponds to a clause other than the return clause).

<div style="text-align: center;">
    <img src="https://raw.githubusercontent.com/RumbleDB/bigdata-exercises/master/Big_Data/JSONiq-Jupyter/Textbook_images/289.png" alt="image" style="width: 40%; height: auto;">
</div>

## Dynamic phase

During the dynamic phase, the root of the tree is asked to produce a sequence of items, which is to be the final output of the query as a whole.

Then, recursively, each node in the tree will ask its children to produce sequences of items (or tuple streams). Each node then combines the sequences of items (or tuple streams) it receives from its children in order to produce its own sequence of items according to its semantics, and pass it to its parent. That way, the data flows all the way from the bottom of the tree to its root, and the final results are obtained and presented to the user or written to persistent storage (drive or data lake). 

There are many different ways for a runtime iterator to produce an output sequence of items (or tuple stream) and pass it to its parent runtime iterator in the tree:

- By materializing sequences of items (or tuple streams) completely in local computer memory.

- By locally iterating over each item in a sequence, one after the other (or over each tuple in a tuple stream, one after the other).

- By working in parallel over the sequence of items, internally stored as a Spark RDD.

- By working in parallel over the sequence of items (or tuple stream), internally stored as a Spark DataFrame.

- By natively converting the semantics of the iterator to native Spark SQL.

### Materialization

When a sequence of items is materialized, it means that an actual List (or Array, or Vector), native to the language of implementation (in this case Java) is stored in local memory, filled with the items. This is, of course, only possible if the sequence is small enough that it fits.

The parent runtime iterator then directly processes this List in place, in order to produce its output. 

A special case is when an expression is statically known to return either zero or one item (e.g., an addition, or a logical expression), but not more. Then no List structure is needed, and a single Item can be returned via a simple method call in the language of implementation (Java).

<div style="text-align: center;">
    <img src="https://raw.githubusercontent.com/RumbleDB/bigdata-exercises/master/Big_Data/JSONiq-Jupyter/Textbook_images/290.png" alt="image" style="width: 60%; height: auto;">
</div> 


### Streaming 

With larger sequences of items, it becomes impracticable to materialize because the footprint in memory becomes too large, and the size of the sequences that can be manipulated is strictly limited by the total memory available.

Thus, another technique is used instead: streaming. When a sequence of items (or tuple stream) is produced and consumed in a streaming fashion, it means that the items (or tuples) are produced and consumed one by one, iteratively. But the whole sequence of items (or tuple stream) is never stored anywhere.

The classical pattern for doing so is known as the Volcano iterator architecture. It consists in first calling a method called open() to initialize the iterator, then hasNext() to check if there exists a next item (or tuple), and if so, then next() to consume it; and then hasNext() and next() are called again and repeatedly as long as hasNext() does not return false. When it finally does, close() is called to clean up the iterator.

With this technique, it is possible to process sequences that are much larger than memory, because the actual sequence is never fully stored. However, there are two problems with this: first, it can take a lot of time to go through the entire sequence (imagine doing so with billions or trillions of items). Second, there are expressions or clauses that are not compatible with streaming (consider, for example, the group by or order by clause, which cannot be implemented without materializing their full input).

### Parallel execution (with RDDs)

When a sequence becomes unreasonably large, RumbleDB switches to a parallel execution, leveraging Spark capabilities: the sequences of items are passed and processed as RDDs of Item objects. Each runtime iterator then calls Spark transformations on these RDDs to produce an output RDD, or in some cases (e.g., count()) calls a Spark action to produce a single, local, materialized Item with an action.

<div style="text-align: center;">
    <img src="https://raw.githubusercontent.com/RumbleDB/bigdata-exercises/master/Big_Data/JSONiq-Jupyter/Textbook_images/291.png" alt="image" style="width: 60%; height: auto;">
</div> 

<br>

A Spark transformation or action often needs to be supplied with an additional function (e.g., a map function, a filter function), called a Spark UDF (for “User-Defined Function”). What RumbleDB then does is that it squeezes an entire runtime iterator subtree into a UDF, so that this subtree can be recursively evaluated on each node of the cluster, as a local execution (materialized or streaming).

For example, imagine a filter expression, with a specific predicate, on a sequence of a billion items. If the input sequence is physically available as an RDD, RumbleDB squeezes the predicate’s runtime iterator tree into a UDF, and invokes the filter() transformation with this UDF, resulting in a smaller RDD that contains the filtered sequence of items. Physically, the predicate’s runtime iterator tree will be evaluated on items, in parallel, across thousands of machines in the cluster; relative to each one of these machines, this is a local execution (local to each machine), where the predicate iterator streams over each batch

<div style="text-align: center;">
    <img src="https://raw.githubusercontent.com/RumbleDB/bigdata-exercises/master/Big_Data/JSONiq-Jupyter/Textbook_images/292.png" alt="image" style="width: 50%; height: auto;"> <br> <br> 
    <img src="https://raw.githubusercontent.com/RumbleDB/bigdata-exercises/master/Big_Data/JSONiq-Jupyter/Textbook_images/293.png" style="width: 50%; height: auto;">
</div> 

<br> 

The use of RDDs is specific to sequences of items and does not exist for tuple streams.

### Parallel execution (with DataFrames)

The RDD implementation supports heterogeneous sequences by leveraging the polymorphism of Item objects. However, this is not efficient in the case that Items in the same sequence happen to have a regular structure.

Thus, if the Items in a sequence are valid against a specific schema, or even against an array type or an atomic type, the underlying physical storage in memory relies on Spark DataFrames instead of RDDs. Homogeneous sequences of arrays or of atomics (e.g., a sequence of integers) are physical implemented as a one-column DataFrame with the corresponding type.

Thus, there exists a mapping from JSONiq types to Spark SQL types. In the case that there is no corresponding Spark SQL type, the implementation falls back to RDDs.

<div style="text-align: center;">
    <img src="https://raw.githubusercontent.com/RumbleDB/bigdata-exercises/master/Big_Data/JSONiq-Jupyter/Textbook_images/294.png" alt="image" style="width: 50%; height: auto;"> 
</div> 

<br>

To summarize, homogeneous sequences of the most common types are stored in DataFrames, and RDDs are used in all other cases.

<div style="text-align: center;">
    <img src="https://raw.githubusercontent.com/RumbleDB/bigdata-exercises/master/Big_Data/JSONiq-Jupyter/Textbook_images/295.png" alt="image" style="width: 60%; height: auto;"> 
</div> 

<br>

DataFrames are also consistently used for storing tuple streams and parallelizing the execution of FLWOR clauses. In FLWOR DataFrames, every column corresponds to one FLWOR variable, which is similar to the visuals provided earlier for FLWOR expressions in this chapter. The column type can either be native if the variable type can be mapped seamlessly to a Spark SQL type. Otherwise, the column type will be binary and Items are serialized to sequences of types and deserialized back on demand.

### Parallel execution (with Native SQL)

In some cases (more in every release), RumbleDB is able to evaluate the query using only Spark SQL, compiling JSONiq to SQL directly instead of packing Java runtime iterators in UDFs. This leads to faster execution, because UDFs are slower than a native execution in SQL. This is because, to a SQL optimizer, UDFs are opaque and prevent automatic optimizations.

RumbleDB switches seamless between all execution modes, even within the same query, as shown on the following diagram.

<div style="text-align: center;">
    <img src="https://raw.githubusercontent.com/RumbleDB/bigdata-exercises/master/Big_Data/JSONiq-Jupyter/Textbook_images/296.png" alt="image" style="width: 60%; height: auto;"> 
</div> 

# Learning objectives

The following is a checklist that students can use during their learning in order to self-assess their mastery of the material.

1. Can you explain why a language such as JSONiq provides, in the context of denormalized data, a similar functionality as SQL in a relational database? Do you understand how it generalizes querying to nested, heterogeneous data models, and is thus more powerful than SQL?

2. Can you name and describe the first-class citizen of the JSONiq Data Model: a sequences of item?

3. Can you name various kinds of items in the JDM?

4. Can you name a few query languages in the XML/JSON ecosystem?

5. Do you understand how to navigate nested structures in JSONiq (object lookup, array lookup, array unboxing, filtering predicates)?

6. Are you able, in JSONiq, to construct items (atomic items, elements, etc.)?

7. Are you able, in JSONiq, to perform logical operations? Do you understand what the Effective Boolean Value of a sequence is, and how it fits in the context of logical operations?

8. Are you able, in JSONiq, to perform arithmetic operations (addition, etc.)? Do you understand the constraints on the input sequences of such operations? Can you explain the behavior of these operations on empty sequences? Can you explain what happens if one of the two operands is a node and not an atomic item?

9. Are you able, in JSONiq, to perform comparisons (lt, ge, etc.)? Do you understand the constraints on the input sequences of such operations? Can you explain the behavior of these operations on empty sequences? Can you explain what happens if one of the two operands is a node and not an atomic item?

10. Do you understand how general comparisons (<, >=, etc.) work on sequences with more than one item, and implicitly use an existential quantifier)? Can you explain why, in the special case of single items on both sides, they are equivalent to value comparisons? 

11. Do you understand how FLWOR expressions work and describe what they return? (for clause, let clause, where clause, order by clause, etc.)

12. Are you able to use further expressions (if-then-else, switch, ...)?

13. Do you understand how to dynamically build JSON content with object and array constructors?

14. Do you understand that expressions can be combined at will, as any expression takes and returns sequences of items? Do you know how to use parentheses to make precedence clear, like you did in primary school with addition and multiplication?

15. Do you know the JSONiq type syntax (atomic types taken from XML Schema, syntax for XML node types, as well as cardinality symbols), and how to use type checking (instance of, cast as, etc.)?

16. Given a collection of JSON objects (for example JSON Lines on HDFS), are you able to write JSONiq queries (FLWOR) that do projection? selection? grouping? ordering? joins? sorting?

17. Do you understand how, given a SQL query, you can write something equivalent in JSONiq?

18. Do you understand that this is not true the other way round (rewriting JSONiq as an equivalent SQL query)? Can you characterize examples of when this is not true or very difficult (hint: denormalized data)?

# Literature and recommended readings

The following is a list of recommended material for further reading and study.

- Müller, I, Fourny, G, Irimescu, S, Cikis, C, Alonso, G. (2021). Rumble: Data Independence for Large, Messy Data Sets. In: PVLDB 14(4). 10-minute presentation of RumbleDB (above paper) at VLDB 2021. [Watch on YouTube](https://www.youtube.com/watch?v=q3IxXnYZ8UM)

- JSONiq language reference, and online sandbox. [Visit JSONiq.org](https://www.jsoniq.org/)

- RumbleDB engine, free and open source. [Visit RumbleDB.org](https://www.rumbledb.org/)

- Graur, D, Müller, I, Proffitt, M, Fourny, G, Watts, G, Alonso, G. Evaluating Query Languages and Systems for High-Energy Physics Data. Joint interdisciplinary work ETH Zurich - University of Washington.
In: PVLDB 15(2).