# Big Data HS 2023

## JSONiq tutorial - week 6

This is the JSONiq tutorial for week 6.

Do not forget to use localhost:8888 as the URL to make sure the notebook is accessed via docker! And if it does not work, you should delete all containers, images, and volumes, then try again with



````
docker-compose up
````

Like last week, junst run the cell below to connect the Jupyter notebook with RumbleDB.

In [1]:
%load_ext rumbledb
%env RUMBLEDB_SERVER=http://localhost:9090/jsoniq

env: RUMBLEDB_SERVER=http://localhost:9090/jsoniq


## Built-in functions
https://rumble.readthedocs.io/en/latest/Function%20library/

In [223]:
%%jsoniq
let $players := json-file("players.jsonl")

for $player in $players
return {
"nationality": $player.nationality
}

Took: 4.0657958984375 ms
{"nationality": "USA"}
{"nationality": "Brazil"}
{"nationality": "Spain"}
{"nationality": "USA"}
{"nationality": "France"}
{"nationality": "Spain"}
{"nationality": "Brazil"}
{"nationality": "USA"}


In [188]:
%%jsoniq
let $players := json-file("players.jsonl")
for $player in $players
let $nat := $player.nationality
group by $nat
return {
  "nationality": $nat,
  "count": count($player[[]][$$.nationality = $nat])
}

Took: 0.225020170211792 ms
{"nationality": "USA", "count": 3}
{"nationality": "Spain", "count": 2}
{"nationality": "France", "count": 1}
{"nationality": "Brazil", "count": 2}


In [207]:
%%jsoniq
let $players := json-file("players.jsonl")
for $player in $players
let $nat := $player.nationality
group by $nat
return {
  "nationality": $nat,
  "count": count($player)
}

Took: 0.12901711463928223 ms
{"nationality": "Brazil", "count": 2}
{"nationality": "France", "count": 1}
{"nationality": "Spain", "count": 2}
{"nationality": "USA", "count": 3}


In [3]:
%%jsoniq
let $fruits := ["apple", "banana", "apple", "orange", "banana", "apple", "orange", "pear"]
for $fruit in distinct-values($fruits[])
group by $fruit
return {
    $fruit: count($fruits[][$$ = $fruit])
}

Took: 0.9580099582672119 ms
{"apple": 3}
{"pear": 1}
{"banana": 2}
{"orange": 2}


In [4]:
%%jsoniq
let $fruits := ["apple", "banana", "apple", "orange", "banana", "apple", "orange", "pear"]
for $fruit in $fruits[]
group by $fruit
return {
    $fruit: count($fruits)
}

Took: 0.04123997688293457 ms
{"apple": 3}
{"pear": 1}
{"banana": 2}
{"orange": 2}


In [24]:
%%jsoniq
let $fruits := ["apple", "banana", "apple", "orange", "banana", "apple"]
for $fruit in $fruits[]
return {
 $fruit: $fruits
}

Took: 0.03415489196777344 ms
{"apple": ["apple", "banana", "apple", "orange", "banana", "apple"]}
{"banana": ["apple", "banana", "apple", "orange", "banana", "apple"]}
{"apple": ["apple", "banana", "apple", "orange", "banana", "apple"]}
{"orange": ["apple", "banana", "apple", "orange", "banana", "apple"]}
{"banana": ["apple", "banana", "apple", "orange", "banana", "apple"]}
{"apple": ["apple", "banana", "apple", "orange", "banana", "apple"]}


In [26]:
%%jsoniq
let $fruits := ["apple", "banana", "apple", "orange", "banana", "apple"]
for $fruit in $fruits[]
group by $fruit
return {
    $fruit: count($fruits)
}

Took: 0.03659391403198242 ms
{"apple": 3}
{"banana": 2}
{"orange": 1}


In [17]:
%%jsoniq
let $fruits := [{"kind": "apple"}, { "kind" : "banana"}, { "kind" :  "apple"}, { "kind" :  "orange"}, { "kind" :  "banana"}, { "kind" :  "apple" } ]

for $fruit in $fruits[]
    let $kind := $fruit.kind
    group by $kind
return {
    $kind: count($fruit)
}

Took: 0.03167891502380371 ms
{"apple": 3}
{"banana": 2}
{"orange": 1}


In [None]:
# Suppose you have an RDD called 'data' consisting of key-value pairs
data = sc.parallelize([(1, 'apple'), (2, 'banana'), (1, 'orange'), (2, 'apple')])

# Grouping by key and then mapping values to their lengths
grouped_data = data.groupByKey().mapValues(lambda values: len(values))
grouped_data.collect()

In [66]:
%%jsoniq
keys(json-file("http://www.rumbledb.org/samples/git-archive-small.json"))

Took: 0.47971582412719727 ms
"type"
"payload"
"public"
"id"
"created_at"
"actor"
"repo"
"org"


In [67]:
%%jsoniq
distinct-values(json-file("http://www.rumbledb.org/samples/git-archive-small.json").type)

Took: 0.3426690101623535 ms
"CommitCommentEvent"
"MemberEvent"
"PushEvent"
"ForkEvent"
"PullRequestReviewCommentEvent"
"PullRequestEvent"
"CreateEvent"
"DeleteEvent"
"WatchEvent"
"IssueCommentEvent"
"IssuesEvent"
"GollumEvent"
"ReleaseEvent"


## Element access

In [68]:
%%jsoniq
json-file("http://www.rumbledb.org/samples/git-archive-small.json")[1]

Took: 0.24642610549926758 ms
{"id": "7045118886", "type": "PushEvent", "actor": {"id": 20090775, "login": "lainrose", "display_login": "lainrose", "gravatar_id": "", "url": "https://api.github.com/users/lainrose", "avatar_url": "https://avatars.githubusercontent.com/u/20090775?"}, "repo": {"id": 115387592, "name": "lainrose/Python-Grammar", "url": "https://api.github.com/repos/lainrose/Python-Grammar"}, "payload": {"push_id": 2226161589, "size": 1, "distinct_size": 1, "ref": "refs/heads/master", "head": "27a01fbdbec8e26daa40fc8faa052dd0be23836a", "before": "d6fce97b8de28a31d021c9a9f7ac939baa14d208", "commits": [{"sha": "27a01fbdbec8e26daa40fc8faa052dd0be23836a", "author": {"name": "lainrose", "email": "fb4676bf30682e2ece361fd363a69ad11779c42e@Naver.com"}, "message": "Update Study Contents", "distinct": true, "url": "https://api.github.com/repos/lainrose/Python-Grammar/commits/27a01fbdbec8e26daa40fc8faa052dd0be23836a"}]}, "public": true, "created_at": "2018-01-01T15:00:00Z"}


In [69]:
%%jsoniq
json-file("http://www.rumbledb.org/samples/git-archive-small.json")[1].repo.name

Took: 0.32953691482543945 ms
"lainrose/Python-Grammar"


## Filters

In [70]:
%%jsoniq
count(
    let $url := "http://www.rumbledb.org/samples/git-archive-small.json"
    return json-file($url).payload.commits[size($$) gt 10]
)

Took: 0.2909409999847412 ms
6


In [72]:
%%jsoniq
for $i in 1 to 10
return ($i * 2)[$$ gt 10]

Took: 0.035987138748168945 ms
12
14
16
18
20


In [73]:
%%jsoniq
count(
  let $path := "http://www.rumbledb.org/samples/git-archive-small.json"
  for $event in json-file($path)
  let $commits := $event.payload.commits
  where size($commits) gt 10
  return $event
)

Took: 0.36256909370422363 ms
6


In [74]:
%%jsoniq
count(
  let $path := "http://www.rumbledb.org/samples/git-archive-small.json"
  for $event in json-file($path)
  return $event.payload.commits[size($$) gt 10]
)

Took: 0.33940601348876953 ms
6


## Unpacking

In [80]:
%%jsoniq
count(json-file("http://www.rumbledb.org/samples/git-archive-small.json").payload.commits[size($$) eq 0])

Took: 0.23618197441101074 ms
31


In [76]:
%%jsoniq
json-file("http://www.rumbledb.org/samples/git-archive-small.json").payload.commits[size($$) gt 1][1][[]]

Took: 0.4584798812866211 ms
[{"sha": "95e600df9a5a669f53dc7de28147814678d12e81", "author": {"name": "Phil Gengler", "email": "e888d2bd6f13f82caa51a37c03d034c76f661ba3@pgengler.net"}, "message": "Get days/tasks via JSONAPI", "distinct": true, "url": "https://api.github.com/repos/pgengler/todolist-client/commits/95e600df9a5a669f53dc7de28147814678d12e81"}, {"sha": "d348f84df64c5473ba6a95a108e7c0263a434add", "author": {"name": "Phil Gengler", "email": "e888d2bd6f13f82caa51a37c03d034c76f661ba3@pgengler.net"}, "message": "Update tests", "distinct": true, "url": "https://api.github.com/repos/pgengler/todolist-client/commits/d348f84df64c5473ba6a95a108e7c0263a434add"}, {"sha": "9227c61c103ec1ee7b6dc8e126d14bc85fdf3dfd", "author": {"name": "Phil Gengler", "email": "e888d2bd6f13f82caa51a37c03d034c76f661ba3@pgengler.net"}, "message": "Migrate to unified List model", "distinct": true, "url": "https://api.github.com/repos/pgengler/todolist-client/commits/9227c61c103ec1ee7b6dc8e126d14bc85fdf3dfd"}, {

In [77]:
%%jsoniq
json-file("http://www.rumbledb.org/samples/git-archive-small.json").payload.commits[size($$) gt 1][1][[2]]

Took: 0.36048316955566406 ms
{"sha": "d348f84df64c5473ba6a95a108e7c0263a434add", "author": {"name": "Phil Gengler", "email": "e888d2bd6f13f82caa51a37c03d034c76f661ba3@pgengler.net"}, "message": "Update tests", "distinct": true, "url": "https://api.github.com/repos/pgengler/todolist-client/commits/d348f84df64c5473ba6a95a108e7c0263a434add"}


# Navigating an existing JSON dataset

Let us look at an existing dataset on the Web. We picked a [GitHub archive file](https://gharchive.org)
that we stored for convenience at this location: https://www.rumbledb.org/samples/git-archive.json.

Accessing a JSON dataset can be done in two ways depending on the exact format:

- If this is a file that contains a single JSON object spread over multiple lines, use json-doc(URL).
- If this is a file that contains one JSON object per line (JSON Lines), use json-file(URL).

The GitHub archive dataset is in the JSON Lines format, so we open it with json-file.

In [2]:
%%jsoniq
json-file("http://www.rumbledb.org/samples/git-archive-small.json")

Took: 3.137354850769043 ms
{"id": "7045118886", "type": "PushEvent", "actor": {"id": 20090775, "login": "lainrose", "display_login": "lainrose", "gravatar_id": "", "url": "https://api.github.com/users/lainrose", "avatar_url": "https://avatars.githubusercontent.com/u/20090775?"}, "repo": {"id": 115387592, "name": "lainrose/Python-Grammar", "url": "https://api.github.com/repos/lainrose/Python-Grammar"}, "payload": {"push_id": 2226161589, "size": 1, "distinct_size": 1, "ref": "refs/heads/master", "head": "27a01fbdbec8e26daa40fc8faa052dd0be23836a", "before": "d6fce97b8de28a31d021c9a9f7ac939baa14d208", "commits": [{"sha": "27a01fbdbec8e26daa40fc8faa052dd0be23836a", "author": {"name": "lainrose", "email": "fb4676bf30682e2ece361fd363a69ad11779c42e@Naver.com"}, "message": "Update Study Contents", "distinct": true, "url": "https://api.github.com/repos/lainrose/Python-Grammar/commits/27a01fbdbec8e26daa40fc8faa052dd0be23836a"}]}, "public": true, "created_at": "2018-01-01T15:00:00Z"}
{"id": "70451

This is a large file and the previous query output 500 JSON objects. To look closer, let us start looking at just the first object with a number predicate.

In [6]:
%%jsoniq
json-file("http://www.rumbledb.org/samples/git-archive-small.json")[1]

Took: 0.9896159172058105 ms
{"id": "7045118886", "type": "PushEvent", "actor": {"id": 20090775, "login": "lainrose", "display_login": "lainrose", "gravatar_id": "", "url": "https://api.github.com/users/lainrose", "avatar_url": "https://avatars.githubusercontent.com/u/20090775?"}, "repo": {"id": 115387592, "name": "lainrose/Python-Grammar", "url": "https://api.github.com/repos/lainrose/Python-Grammar"}, "payload": {"push_id": 2226161589, "size": 1, "distinct_size": 1, "ref": "refs/heads/master", "head": "27a01fbdbec8e26daa40fc8faa052dd0be23836a", "before": "d6fce97b8de28a31d021c9a9f7ac939baa14d208", "commits": [{"sha": "27a01fbdbec8e26daa40fc8faa052dd0be23836a", "author": {"name": "lainrose", "email": "fb4676bf30682e2ece361fd363a69ad11779c42e@Naver.com"}, "message": "Update Study Contents", "distinct": true, "url": "https://api.github.com/repos/lainrose/Python-Grammar/commits/27a01fbdbec8e26daa40fc8faa052dd0be23836a"}]}, "public": true, "created_at": "2018-01-01T15:00:00Z"}


We can see that there are nested objects and arrays. This is perfect for JSONiq. Let us now figure out all the keys used in this dataset with the keys() function.

In [7]:
%%jsoniq
keys(json-file("http://www.rumbledb.org/samples/git-archive-small.json"))

Took: 1.315551996231079 ms
"type"
"payload"
"public"
"id"
"created_at"
"actor"
"repo"
"org"


Let us look closer at the key called "type". What values does it take? We can use dot-based navigation to navigate down to these values. This will work nicely on the entire dataset.

In [3]:
%%jsoniq
json-file("http://www.rumbledb.org/samples/git-archive-small.json").type

Took: 0.9869658946990967 ms
"PushEvent"
"PushEvent"
"PullRequestEvent"
"PushEvent"
"WatchEvent"
"PushEvent"
"GollumEvent"
"PushEvent"
"PullRequestEvent"
"PushEvent"
"PushEvent"
"IssuesEvent"
"PushEvent"
"PullRequestEvent"
"WatchEvent"
"PushEvent"
"WatchEvent"
"PullRequestEvent"
"IssueCommentEvent"
"PushEvent"
"PushEvent"
"ForkEvent"
"PushEvent"
"IssueCommentEvent"
"CreateEvent"
"IssuesEvent"
"PushEvent"
"PushEvent"
"PushEvent"
"CreateEvent"
"CreateEvent"
"PushEvent"
"ForkEvent"
"CreateEvent"
"CreateEvent"
"PushEvent"
"IssueCommentEvent"
"PushEvent"
"PushEvent"
"ForkEvent"
"WatchEvent"
"PushEvent"
"DeleteEvent"
"PushEvent"
"PushEvent"
"IssueCommentEvent"
"PushEvent"
"CreateEvent"
"WatchEvent"
"PushEvent"
"PushEvent"
"ForkEvent"
"PullRequestEvent"
"IssuesEvent"
"PushEvent"
"WatchEvent"
"PushEvent"
"PushEvent"
"PushEvent"
"PushEvent"
"PushEvent"
"PushEvent"
"PushEvent"
"PushEvent"
"PushEvent"
"PushEvent"
"CreateEvent"
"PushEvent"
"PushEvent"
"WatchEvent"
"CreateEvent"
"PushEvent"
"PushEve

It looks like there are a lot of duplicates in there. Let us use distinct-values() to figure out all unique values.

In [8]:
%%jsoniq
distinct-values(json-file("http://www.rumbledb.org/samples/git-archive-small.json").type)

Took: 1.3639898300170898 ms
"CommitCommentEvent"
"MemberEvent"
"PushEvent"
"ForkEvent"
"PullRequestReviewCommentEvent"
"PullRequestEvent"
"CreateEvent"
"DeleteEvent"
"WatchEvent"
"IssueCommentEvent"
"IssuesEvent"
"GollumEvent"
"ReleaseEvent"


So we see that for the key "type", all values are strings and there are only... how many, by the way? Let us use count().

In [9]:
%%jsoniq
count(distinct-values(json-file("http://www.rumbledb.org/samples/git-archive-small.json").type))

Took: 1.0597796440124512 ms
13


So there are 13. Note that count() works just as well on the entire dataset, to know how many objects there are.

In [10]:
%%jsoniq
count(json-file("http://www.rumbledb.org/samples/git-archive-small.json"))

Took: 0.7698550224304199 ms
500


Let us know look at nested objects. It seems the key "actor" has these, so let us now use the dot object lookup to find all these values.

In [11]:
%%jsoniq
json-file("http://www.rumbledb.org/samples/git-archive-small.json").actor

Took: 0.8339860439300537 ms
{"id": 20090775, "login": "lainrose", "display_login": "lainrose", "gravatar_id": "", "url": "https://api.github.com/users/lainrose", "avatar_url": "https://avatars.githubusercontent.com/u/20090775?"}
{"id": 17426563, "login": "tumhopaasmere", "display_login": "tumhopaasmere", "gravatar_id": "", "url": "https://api.github.com/users/tumhopaasmere", "avatar_url": "https://avatars.githubusercontent.com/u/17426563?"}
{"id": 1449578, "login": "daa84", "display_login": "daa84", "gravatar_id": "", "url": "https://api.github.com/users/daa84", "avatar_url": "https://avatars.githubusercontent.com/u/1449578?"}
{"id": 22536460, "login": "thautwarm", "display_login": "thautwarm", "gravatar_id": "", "url": "https://api.github.com/users/thautwarm", "avatar_url": "https://avatars.githubusercontent.com/u/22536460?"}
{"id": 18603467, "login": "markstachowski", "display_login": "markstachowski", "gravatar_id": "", "url": "https://api.github.com/users/markstachowski", "avatar_u

We can chain dot object lookups to navigate further down, for example to logins. Let us figure out how many distinct logins there are.

In [12]:
%%jsoniq
count(distinct-values(json-file("http://www.rumbledb.org/samples/git-archive-small.json").actor.login))

Took: 0.9090051651000977 ms
374


The id field inside the actor object seems to be an integer. What is the highest value? The max() function also works at large scales, just like count() and also min(), avg() and sum().

In [13]:
%%jsoniq
max(json-file("http://www.rumbledb.org/samples/git-archive-small.json").actor.id)

Took: 0.9605281352996826 ms
35003609


Alright, let us know look for nested arrays. There does not seem to have any inside the actor object, so let us try the key "payload". Let us just look at the first one.

In [14]:
%%jsoniq
json-file("http://www.rumbledb.org/samples/git-archive-small.json")[1].payload

Took: 0.6818640232086182 ms
{"push_id": 2226161589, "size": 1, "distinct_size": 1, "ref": "refs/heads/master", "head": "27a01fbdbec8e26daa40fc8faa052dd0be23836a", "before": "d6fce97b8de28a31d021c9a9f7ac939baa14d208", "commits": [{"sha": "27a01fbdbec8e26daa40fc8faa052dd0be23836a", "author": {"name": "lainrose", "email": "fb4676bf30682e2ece361fd363a69ad11779c42e@Naver.com"}, "message": "Update Study Contents", "distinct": true, "url": "https://api.github.com/repos/lainrose/Python-Grammar/commits/27a01fbdbec8e26daa40fc8faa052dd0be23836a"}]}


Here we see that there is a nested array associated with key "commits".

In [15]:
%%jsoniq
json-file("http://www.rumbledb.org/samples/git-archive-small.json")[1].payload.commits

Took: 0.6369149684906006 ms
[{"sha": "27a01fbdbec8e26daa40fc8faa052dd0be23836a", "author": {"name": "lainrose", "email": "fb4676bf30682e2ece361fd363a69ad11779c42e@Naver.com"}, "message": "Update Study Contents", "distinct": true, "url": "https://api.github.com/repos/lainrose/Python-Grammar/commits/27a01fbdbec8e26daa40fc8faa052dd0be23836a"}]


In this case, there is only one object in this array. Is there, by any chance, any one of these arrays that has more than one commit? For this, we can use a Boolean predicate. Let us evaluate the predicate

size($$) gt 1

which uses the size function and the gt (greater than) comparison and where $$ is the current array being tested.

In [16]:
%%jsoniq
json-file("http://www.rumbledb.org/samples/git-archive-small.json").payload.commits[size($$) gt 1]

Took: 0.8224687576293945 ms
[{"sha": "95e600df9a5a669f53dc7de28147814678d12e81", "author": {"name": "Phil Gengler", "email": "e888d2bd6f13f82caa51a37c03d034c76f661ba3@pgengler.net"}, "message": "Get days/tasks via JSONAPI", "distinct": true, "url": "https://api.github.com/repos/pgengler/todolist-client/commits/95e600df9a5a669f53dc7de28147814678d12e81"}, {"sha": "d348f84df64c5473ba6a95a108e7c0263a434add", "author": {"name": "Phil Gengler", "email": "e888d2bd6f13f82caa51a37c03d034c76f661ba3@pgengler.net"}, "message": "Update tests", "distinct": true, "url": "https://api.github.com/repos/pgengler/todolist-client/commits/d348f84df64c5473ba6a95a108e7c0263a434add"}, {"sha": "9227c61c103ec1ee7b6dc8e126d14bc85fdf3dfd", "author": {"name": "Phil Gengler", "email": "e888d2bd6f13f82caa51a37c03d034c76f661ba3@pgengler.net"}, "message": "Migrate to unified List model", "distinct": true, "url": "https://api.github.com/repos/pgengler/todolist-client/commits/9227c61c103ec1ee7b6dc8e126d14bc85fdf3dfd"}, {

Let us just take the first one to have more visibility.

In [17]:
%%jsoniq
json-file("http://www.rumbledb.org/samples/git-archive-small.json").payload.commits[size($$) gt 1][1]

Took: 1.2692430019378662 ms
[{"sha": "95e600df9a5a669f53dc7de28147814678d12e81", "author": {"name": "Phil Gengler", "email": "e888d2bd6f13f82caa51a37c03d034c76f661ba3@pgengler.net"}, "message": "Get days/tasks via JSONAPI", "distinct": true, "url": "https://api.github.com/repos/pgengler/todolist-client/commits/95e600df9a5a669f53dc7de28147814678d12e81"}, {"sha": "d348f84df64c5473ba6a95a108e7c0263a434add", "author": {"name": "Phil Gengler", "email": "e888d2bd6f13f82caa51a37c03d034c76f661ba3@pgengler.net"}, "message": "Update tests", "distinct": true, "url": "https://api.github.com/repos/pgengler/todolist-client/commits/d348f84df64c5473ba6a95a108e7c0263a434add"}, {"sha": "9227c61c103ec1ee7b6dc8e126d14bc85fdf3dfd", "author": {"name": "Phil Gengler", "email": "e888d2bd6f13f82caa51a37c03d034c76f661ba3@pgengler.net"}, "message": "Migrate to unified List model", "distinct": true, "url": "https://api.github.com/repos/pgengler/todolist-client/commits/9227c61c103ec1ee7b6dc8e126d14bc85fdf3dfd"}, {

We can expand it to a sequence of objects using the [] array unboxing syntax.

In [20]:
%%jsoniq
json-file("http://www.rumbledb.org/samples/git-archive-small.json").payload.commits[size($$) gt 1][1][]

Took: 0.8981659412384033 ms
{"sha": "95e600df9a5a669f53dc7de28147814678d12e81", "author": {"name": "Phil Gengler", "email": "e888d2bd6f13f82caa51a37c03d034c76f661ba3@pgengler.net"}, "message": "Get days/tasks via JSONAPI", "distinct": true, "url": "https://api.github.com/repos/pgengler/todolist-client/commits/95e600df9a5a669f53dc7de28147814678d12e81"}
{"sha": "d348f84df64c5473ba6a95a108e7c0263a434add", "author": {"name": "Phil Gengler", "email": "e888d2bd6f13f82caa51a37c03d034c76f661ba3@pgengler.net"}, "message": "Update tests", "distinct": true, "url": "https://api.github.com/repos/pgengler/todolist-client/commits/d348f84df64c5473ba6a95a108e7c0263a434add"}
{"sha": "9227c61c103ec1ee7b6dc8e126d14bc85fdf3dfd", "author": {"name": "Phil Gengler", "email": "e888d2bd6f13f82caa51a37c03d034c76f661ba3@pgengler.net"}, "message": "Migrate to unified List model", "distinct": true, "url": "https://api.github.com/repos/pgengler/todolist-client/commits/9227c61c103ec1ee7b6dc8e126d14bc85fdf3dfd"}
{"sha

We can also lookup a specific position, say, the second object, with the [[ ]] array lookup syntax.

In [23]:
%%jsoniq
json-file("http://www.rumbledb.org/samples/git-archive-small.json").payload.commits[size($$) gt 1][1][[2]]

Took: 0.8969278335571289 ms
{"sha": "d348f84df64c5473ba6a95a108e7c0263a434add", "author": {"name": "Phil Gengler", "email": "e888d2bd6f13f82caa51a37c03d034c76f661ba3@pgengler.net"}, "message": "Update tests", "distinct": true, "url": "https://api.github.com/repos/pgengler/todolist-client/commits/d348f84df64c5473ba6a95a108e7c0263a434add"}


And now, please hold for something awesome. We can unbox all arrays of the entire collection at the same time by just using the [] syntax on the entire dataset.

In [24]:
%%jsoniq
json-file("http://www.rumbledb.org/samples/git-archive-small.json").payload.commits[]

Took: 1.4888331890106201 ms
{"sha": "27a01fbdbec8e26daa40fc8faa052dd0be23836a", "author": {"name": "lainrose", "email": "fb4676bf30682e2ece361fd363a69ad11779c42e@Naver.com"}, "message": "Update Study Contents", "distinct": true, "url": "https://api.github.com/repos/lainrose/Python-Grammar/commits/27a01fbdbec8e26daa40fc8faa052dd0be23836a"}
{"sha": "45b2f857540d7d4286d1abef204aef167190be0f", "author": {"name": "tumhopaasmere", "email": "bcc6c59276ad7bbcd0b972dd58baaef7cccc22d4@mailinator.com"}, "message": "GIT CloneShare Commit", "distinct": true, "url": "https://api.github.com/repos/tumhopaasmere/tumhopaasmere/commits/45b2f857540d7d4286d1abef204aef167190be0f"}
{"sha": "ea291a9baea441ea815e822bba5e8c9f330542f7", "author": {"name": "thautwarm", "email": "820a7b45b87f3c40f5e1c273015816c9c19a8401@outlook.com"}, "message": "API overview and example", "distinct": true, "url": "https://api.github.com/repos/thautwarm/EBNFParser/commits/ea291a9baea441ea815e822bba5e8c9f330542f7"}
{"sha": "95e600d

These are objects. It is all too tempting to navigate further down with more dot object-lookup syntax. All at the same time, obviously. Let us figure out how many unique emails there are in all commits of all events.

In [25]:
%%jsoniq
count(distinct-values(json-file("http://www.rumbledb.org/samples/git-archive-small.json").payload.commits[].author.email))

Took: 1.7867281436920166 ms
256


Now, how many unique emails are there in first commits?

In [26]:
%%jsoniq
count(distinct-values(json-file("http://www.rumbledb.org/samples/git-archive-small.json").payload.commits[[1]].author.email))

Took: 0.8590991497039795 ms
233


You have now learned how to navigate large JSON datasets with the dot object lookup syntax, the [] array unboxing syntax, the [[ ]] array lookup syntax, number predicates, and Boolean predicates.

All of these work nicely on very large sequences, and you can chain them arbitrarily. In fact, this will all happen in parallel on the cores of your machine or even on a large cluster.

You also saw how to aggregate large sequences of values with min, max, count, avg and sum.

Finally, you saw how to eliminate duplicates with distinct-values.

## Iteration

In the previous tutorial, we looked at let and return clauses.
It is possible to iterate on the elements in a sequence with another clause: the for clause, like so:

In [52]:
%%jsoniq
for $i in 1 to 10
return ($i * 2)[$$ gt 10]

Took: 0.04629993438720703 ms
12
14
16
18
20


The sequence to iterator on can itself come from a dataset, such as the one we were using previously:

In [29]:
%%jsoniq
for $event in json-file("http://www.rumbledb.org/samples/git-archive-small.json")
return size($event.payload.commits)[$$ gt 10]

Took: 1.6578338146209717 ms
20
16
20
17
20
20


For clauses can be mixed with let clauses:

In [30]:
%%jsoniq
let $path := "http://www.rumbledb.org/samples/git-archive-small.json"
for $event in json-file($path)
let $commits := $event.payload.commits
return size($commits)

Took: 2.2407660484313965 ms
1
1
1
0
4
2
1
1
1
1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
5
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
6
1
1
1
1
1
1
1
1
1
0
2
1
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
20
1
1
1
1
1
1
1
1
1
1
1
1
16
1
20
2
1
0
1
1
1
1
2
1
3
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
0
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
17
0
1
1
1
1
1
0
1
1
1
2
20
1
0
1
1
1
2
2
1
1
0
1
1
1
1
0
1
1


And the results can also be nested in a more complex query: for example, let us compute the max of all these array sizes.

In [31]:
%%jsoniq
max(
  let $path := "http://www.rumbledb.org/samples/git-archive-small.json"
  for $event in json-file($path)
  let $commits := $event.payload.commits
  return size($commits)
)

Took: 1.7732129096984863 ms
20


A third kind of clause is the where clause: it allows you to filter events. Let us only keep those with more than 10 commits, and count them.

In [32]:
%%jsoniq
count(
  let $path := "http://www.rumbledb.org/samples/git-archive-small.json"
  for $event in json-file($path)
  let $commits := $event.payload.commits
  where size($commits) gt 10
  return $event
)

Took: 0.978795051574707 ms
6


In [58]:
%%jsoniq
count(
  let $path := "http://www.rumbledb.org/samples/git-archive-small.json"
  for $event in json-file($path)
  return $event.payload.commits[size($$) gt 10]
)

Took: 1.1853728294372559 ms
6
