# Big Data HS 2024

## JSONiq tutorial - week 7

This is the JSONiq tutorial for week 7.

Do not forget to use localhost:8888 as the URL to make sure the notebook is accessed via docker! And if it does not work, you should delete all containers, images, and volumes, then try again with



````
docker-compose up
````

Like last week, junst run the cell below to connect the Jupyter notebook with RumbleDB.

In [None]:
%load_ext rumbledb
%env RUMBLEDB_SERVER=http://rumble:9090/jsoniq

## Navigating an existing JSON dataset

We continue with an existing dataset on the Web. Recall the following query, which opens the textual dataset as a sequence of strings.

In [None]:
%%jsoniq
unparsed-text-lines("https://www.rumbledb.org/samples/hamlet.txt")

We are now going to logically simulate a MapReduce-like job that counts the number of occurences of each word.
First, we can add a for clause to iterate over the line (and do nothing else, so the query is equivalent to the previous one).

In [None]:
%%jsoniq
for $line in unparsed-text-lines("https://www.rumbledb.org/samples/hamlet.txt")
return $line

Next, we can tokenize the lines. For simplicity, we will use spaces. The builtin tokenize() functions splits strings into several strings and by default, does this based on space characters.

Tokenizing each string in this way would correspond to the mapping phase of MapReduce (the value associated with each one of the words is implicitly 1).

In [None]:
%%jsoniq
for $line in unparsed-text-lines("https://www.rumbledb.org/samples/hamlet.txt")
return tokenize($line)

We can also bind an intermediate variable to each token for convenience.

In [None]:
%%jsoniq
for $line in unparsed-text-lines("https://www.rumbledb.org/samples/hamlet.txt")
for $token in tokenize($line)
return $token

We can make the intermediate key-value pairs explicit:

In [None]:
%%jsoniq
for $line in unparsed-text-lines("https://www.rumbledb.org/samples/hamlet.txt")
for $token in tokenize($line)
let $pair := { $token : 1 }
return $pair

Next, we can use a group by clause, which essentially handles the shuffling and groups all words together that are the same.

After the group by clause, in each group, \\$t will be bound to the current token, and \\$pair (which precedes the group by) will now contain the *sequence* of all pairs with the current token as a key.

Thus, a JSONiq group by clause is similar to a SQL GROUP BY clause, but it is more generic because of its ability to bind each non-key variable to the sequence of all its values within a group, with no obligation to aggregate.

Note how we dynamically navigate to all the values in the sequence of pairs with \\$pair.\\$t, where \\$t is the current token and $pair contains all pairs with that token.

In [None]:
%%jsoniq
for $line in unparsed-text-lines("https://www.rumbledb.org/samples/hamlet.txt")
for $token in tokenize($line)
let $pair := { $token : 1 }
group by $t := keys($pair)[1]
return 
{
    $t : sum($pair.$t)
}

We can clean up a bit by binding the count with an intermediate variable like so:

In [None]:
%%jsoniq
for $line in unparsed-text-lines("https://www.rumbledb.org/samples/hamlet.txt")
for $token in tokenize($line)
let $pair := { $token : 1 }
group by $t := keys($pair)[1]
let $count := sum($pair.$t)
return 
{
    $t : $count
}

Which allows us to sort by descending counts and spot the most common tokens. The order by clause is similar to the SQL ORDER BY clause and also offers the choice between ascending and descending.

In [None]:
%%jsoniq
for $line in unparsed-text-lines("https://www.rumbledb.org/samples/hamlet.txt")
for $token in tokenize($line)
let $pair := { $token : 1 }
group by $t := keys($pair)[1]
let $count := sum($pair.$t)
order by $count descending
return 
{
    $t : $count
}

Note that we can simplify the query a bit, but this is because JSONiq is more high-level than MapReduce and does not force the use of keys!

In [None]:
%%jsoniq
for $line in unparsed-text-lines("https://www.rumbledb.org/samples/hamlet.txt")
for $token in tokenize($line)
group by $t := $token
order by count($token) descending
return 
{
    $t : count($token)
}

We can also limit the size of the output with a count clause. This would be similar to the use of LIMIT and OFFSET clauses in SQL, but the filtering can be done more generally than SQL with a where clause.

This is also an opportunity to say that the order of the clauses in JSONiq is very flexible and generic, whereas in SQL the clauses have to be in the order of SELECT FROM WHERE GROUP BY HAVING ORDER LIMIT OFFSet. In JSONiq, the only requirement is that the first clause is either a for or a let, and that the last clause is a return clause.

In [None]:
%%jsoniq
for $line in unparsed-text-lines("https://www.rumbledb.org/samples/hamlet.txt")
for $token in tokenize($line)
group by $t := $token
order by count($token) descending
count $c
where $c le 10
return 
{
    $t : count($token)
}

# Try your own queries!

This notebook is interactive. You can edit all queries above and also execute your own! We will show you more features every week.

In [None]:
%%jsoniq
1+1

In [None]:
%%jsoniq
1+1

In [None]:
%%jsoniq
1+1

In [None]:
%%jsoniq
1+1