# Big Data HS 2025

## JSONiq tutorial - week 6

Every week, you will get a small tutorial notebook that introduces you to the JSONiq language with the RumbleDB engine. You can simply copy this notebook to the "notebooks" subfolder in your Exam MagicBox docker environment (the same environment that contains past exams, PostgreSQL, Spark, RumbleDB, etc).

The instructions are in week 1's tutorial.


Like last week, junst run the cell below to connect the Jupyter notebook with RumbleDB.

In [None]:
%load_ext jsoniqmagic

## Navigating an existing JSON dataset

Let us look at an existing dataset on the Web. We picked a [GitHub archive file](https://gharchive.org)
that we stored for convenience at this location: https://www.rumbledb.org/samples/git-archive.json.

Accessing a JSON dataset can be done in two ways depending on the exact format:

- If this is a file that contains a single JSON object spread over multiple lines, use json-doc(URL).
- If this is a file that contains one JSON object per line (JSON Lines), use json-file(URL).

The GitHub archive dataset is in the JSON Lines format, so we open it with json-file.

In [None]:
%%jsoniq
json-file("http://www.rumbledb.org/samples/git-archive-small.json")

This is a large file and the previous query output 500 JSON objects. To look closer, let us start looking at just the first object with a number predicate.

In [None]:
%%jsoniq
json-file("http://www.rumbledb.org/samples/git-archive-small.json")[1]

We can see that there are nested objects and arrays. This is perfect for JSONiq. Let us now figure out all the keys used in this dataset with the keys() function.

In [None]:
%%jsoniq
keys(json-file("http://www.rumbledb.org/samples/git-archive-small.json"))

Let us look closer at the key called "type". What values does it take? We can use dot-based navigation to navigate down to these values. This will work nicely on the entire dataset.

In [None]:
%%jsoniq
json-file("http://www.rumbledb.org/samples/git-archive-small.json").type

It looks like there are a lot of duplicates in there. Let us use distinct-values() to figure out all unique values.

In [None]:
%%jsoniq
distinct-values(json-file("http://www.rumbledb.org/samples/git-archive-small.json").type)

So we see that for the key "type", all values are strings and there are only... how many, by the way? Let us use count().

In [None]:
%%jsoniq
count(distinct-values(json-file("http://www.rumbledb.org/samples/git-archive-small.json").type))

So there are 13. Note that count() works just as well on the entire dataset, to know how many objects there are.

In [None]:
%%jsoniq
count(json-file("http://www.rumbledb.org/samples/git-archive-small.json"))

Let us know look at nested objects. It seems the key "actor" has these, so let us now use the dot object lookup to find all these values.

In [None]:
%%jsoniq
json-file("http://www.rumbledb.org/samples/git-archive-small.json").actor

We can chain dot object lookups to navigate further down, for example to logins. Let us figure out how many distinct logins there are.

In [None]:
%%jsoniq
count(distinct-values(json-file("http://www.rumbledb.org/samples/git-archive-small.json").actor.login))

The id field inside the actor object seems to be an integer. What is the highest value? The max() function also works at large scales, just like count() and also min(), avg() and sum().

In [None]:
%%jsoniq
max(json-file("http://www.rumbledb.org/samples/git-archive-small.json").actor.id)

Alright, let us know look for nested arrays. There does not seem to have any inside the actor object, so let us try the key "payload". Let us just look at the first one.

In [None]:
%%jsoniq
json-file("http://www.rumbledb.org/samples/git-archive-small.json")[1].payload

Here we see that there is a nested array associated with key "commits".

In [None]:
%%jsoniq
json-file("http://www.rumbledb.org/samples/git-archive-small.json")[1].payload.commits

In this case, there is only one object in this array. Is there, by any chance, any one of these arrays that has more than one commit? For this, we can use a Boolean predicate. Let us evaluate the predicate

size($$) gt 1

which uses the size function and the gt (greater than) comparison and where $$ is the current array being tested.

In [None]:
%%jsoniq
json-file("http://www.rumbledb.org/samples/git-archive-small.json").payload.commits[size($$) gt 1]

Let us just take the first one to have more visibility.

In [None]:
%%jsoniq
json-file("http://www.rumbledb.org/samples/git-archive-small.json").payload.commits[size($$) gt 1][1]

We can expand it to a sequence of objects using the [] array unboxing syntax.

In [None]:
%%jsoniq
json-file("http://www.rumbledb.org/samples/git-archive-small.json").payload.commits[size($$) gt 1][1][]

We can also lookup a specific position, say, the second object, with the [[ ]] array lookup syntax.

In [None]:
%%jsoniq
json-file("http://www.rumbledb.org/samples/git-archive-small.json").payload.commits[size($$) gt 1][1][[2]]

And now, please hold for something awesome. We can unbox all arrays of the entire collection at the same time by just using the [] syntax on the entire dataset.

In [None]:
%%jsoniq
json-file("http://www.rumbledb.org/samples/git-archive-small.json").payload.commits[]

These are objects. It is all too tempting to navigate further down with more dot object-lookup syntax. All at the same time, obviously. Let us figure out how many unique emails there are in all commits of all events.

In [None]:
%%jsoniq
count(distinct-values(json-file("http://www.rumbledb.org/samples/git-archive-small.json").payload.commits[].author.email))

Now, how many unique emails are there in first commits?

In [None]:
%%jsoniq
count(distinct-values(json-file("http://www.rumbledb.org/samples/git-archive-small.json").payload.commits[[1]].author.email))

You have now learned how to navigate large JSON datasets with the dot object lookup syntax, the [] array unboxing syntax, the [[ ]] array lookup syntax, number predicates, and Boolean predicates.

All of these work nicely on very large sequences, and you can chain them arbitrarily. In fact, this will all happen in parallel on the cores of your machine or even on a large cluster.

You also saw how to aggregate large sequences of values with min, max, count, avg and sum.

Finally, you saw how to eliminate duplicates with distinct-values.

## Iteration

In the previous tutorial, we looked at let and return clauses.
It is possible to iterate on the elements in a sequence with another clause: the for clause, like so:

In [None]:
%%jsoniq
for $i in 1 to 10
return $i * 2

The sequence to iterator on can itself come from a dataset, such as the one we were using previously:

In [None]:
%%jsoniq
for $event in json-file("http://www.rumbledb.org/samples/git-archive-small.json")
return size($event.payload.commits)

For clauses can be mixed with let clauses:

In [None]:
%%jsoniq
let $path := "http://www.rumbledb.org/samples/git-archive-small.json"
for $event in json-file($path)
let $commits := $event.payload.commits
return size($commits)

And the results can also be nested in a more complex query: for example, let us compute the max of all these array sizes.

In [None]:
%%jsoniq
max(
  let $path := "http://www.rumbledb.org/samples/git-archive-small.json"
  for $event in json-file($path)
  let $commits := $event.payload.commits
  return size($commits)
)

A third kind of clause is the where clause: it allows you to filter events. Let us only keep those with more than 10 commits, and count them.

In [None]:
%%jsoniq
count(
  let $path := "http://www.rumbledb.org/samples/git-archive-small.json"
  for $event in json-file($path)
  let $commits := $event.payload.commits
  where size($commits) gt 10
  return $event
)

# Schema validation

JSound schemas can be declared as named types, and used to validate and annotate objects, as follows.
Note that the payload field is very heterogeneous and thus marked as a generic object; you can try as an exercise to specify the object layout further! Other systems (such as BigQuery) cannot handle the heterogeneity and have to store it as a string containing the serialized object.

In [None]:
%%jsoniq
declare type local:event as {
    "id" : "long",
    "type" : "string",
    "actor" : {
        "id" : "long",
        "login" : "string",
        "display_login" : "string",
        "url" : "string",
        "avatar_url" : "string",
        "gravatar_id" : "string"
    },
    "repo" : {
        "id" : "long",
        "name" : "string",
        "url" : "string"
    },
    "payload" : "object",
    "public" : "string",
    "created_at" : "dateTimeStamp",
    "org" : "object"
};

let $path := "http://www.rumbledb.org/samples/git-archive-small.json"
for $event in json-file($path)
return validate type local:event { $event }


As a convenience, the jsoniq magic allows to show the output as a DataFrame.

In [None]:
%%jsoniq -df
declare type local:event as {
    "id" : "long",
    "type" : "string",
    "actor" : {
        "id" : "long",
        "login" : "string",
        "display_login" : "string",
        "url" : "string",
        "avatar_url" : "string",
        "gravatar_id" : "string"
    },
    "repo" : {
        "id" : "long",
        "name" : "string",
        "url" : "string"
    },
    "payload" : "object",
    "public" : "string",
    "created_at" : "dateTimeStamp",
    "org" : "object"
};

let $path := "http://www.rumbledb.org/samples/git-archive-small.json"
for $event in json-file($path)
return validate type local:event { $event }


Or as a pandas DataFrame!

In [None]:
%%jsoniq -pdf
declare type local:event as {
    "id" : "long",
    "type" : "string",
    "actor" : {
        "id" : "long",
        "login" : "string",
        "display_login" : "string",
        "url" : "string",
        "avatar_url" : "string",
        "gravatar_id" : "string"
    },
    "repo" : {
        "id" : "long",
        "name" : "string",
        "url" : "string"
    },
    "payload" : "object",
    "public" : "string",
    "created_at" : "dateTimeStamp",
    "org" : "object"
};

let $path := "http://www.rumbledb.org/samples/git-archive-small.json"
for $event in json-file($path)
return validate type local:event { $event }


It is also possible to validate and annotate atomic values in a more lightweight fashion. This is called a cast.

In [None]:
%%jsoniq -pdf
let $date := date("-1234-12-31")
return {
  "year" : year-from-date($date),
  "month" : month-from-date($date),
  "day" : day-from-date($date)
}

This is an alternate syntax:

In [None]:
%%jsoniq -pdf
let $date := "-1234-12-31" cast as date
return {
  "year" : year-from-date($date),
  "month" : month-from-date($date),
  "day" : day-from-date($date)
}

It is possible to catch an unsuccessful cast with a try-catch expression:

In [None]:
%%jsoniq
try {
    let $date := "This is not a date" cast as date
    return {
      "year" : year-from-date($date),
      "month" : month-from-date($date),
      "day" : day-from-date($date)
    }
}
catch * {
    "The cast did not succeed"
}

One can also test whether a cast will succeed or not instead of waiting for an error:

In [None]:
%%jsoniq
"-1234-12-31" castable as date,
"This is not a date" castable as date

Or test whether a value is (without needing a cast) an instance of a specific type.

In [None]:
%%jsoniq
let $date := "-1234-12-31" cast as date
return $date instance of date

By the way, dates and durations can be added like numbers:

In [None]:
%%jsoniq
declare type local:date-and-duration as {
  "da" : "date",
  "du" : "yearMonthDuration"
};

let $my-date := validate type local:date-and-duration {
  {
    "da" : "-1234-12-31",
    "du" : "P3000Y4M"
  }
}
return $my-date.da + $my-date.du

# Try your own queries!

This notebook is interactive. You can edit all queries above and also execute your own! We will show you more features every week.

In [None]:
%%jsoniq
1+1

In [None]:
%%jsoniq
1+1

In [None]:
%%jsoniq
1+1

In [None]:
%%jsoniq
1+1